[Feat]: Deadman alerts for Parent/Child Netdata setups - detect loss of streaming #18436

dogsbody-josh · 2024-08-29T14:57:34Z

Problem

The problem to be solved is alerting to the loss of streaming metrics from a child node to a parent node. This request deals specifically for the use case of a child that stops streaming to a parent.

We will use the term 'Deadman Alerts' throughout this request to describe the detection of a loss of streaming from a child to a parent node. This is to distinguish the functionality from the existing Reachability Alerts Netdata Cloud provides.

On that topic, Netdata Cloud has Reachability Alerts but they are confined to the Cloud option. At the moment they are largely un-configurable and no specific options exists to track/alert on Parent/Child communication loss. Issue 1039 exists to address that but doesn't go as far as this Feature Request in bringing the functionality into the Netdata Agent.

The problems with cloud based Reachability Alerts are:

Excessive alerts during upgrades and from ephemeral nodes or those that are turned off/on on schedule.
No non-cloud based 'in-built' option to detect the loss of streaming from a child to a parent.
No ability to detect partial loss of metric data, for instance if a service like Apache stops sending metrics but the remaining metrics continue to be sent.
The lack of configurability of Cloud Reachability alerts is also a blocker to getting effective notifications of potential streaming problems to reach the right teams in the right way.

Netdata agent has the ability to extensively configure notification options, roles, and recipients across a wide range of notification channels. No such ability exists with Cloud Reachability alerts. Leveraging the existing notification system by integrating it with this new 'Deadman Alert' type functionality would provide extensive notification options and ensure teams are alerted to problems right away.

Finally, because Reachability Alerts are either on or off for all nodes in a room some ephemeral nodes have to get organised differently just to accommodate this lack of functionality. Integrating Deadman Alerts into the agent itself and giving the parent control over alerting for a loss of streaming would remove this limitation.

Overall, the problem being solved here is multi-faceted:

Provide 'Deadman Alerts' for child that stops streaming all or partial data to a parent
Do so without requiring cloud configuration
Provide greater configurability to 'Deadman Alert' notifications using the current notification system

Description

These 'in-built' non-cloud Deadman Alerts need the following functionality:

Configurable per individual node via the Parent. All the points below should be configurable in this way too.
Conditions for triggering the alert should be configurable. Configuration options should include (at least) a customisable delay before triggering (to account for patching/Netdata Agent restarts) and the ability to specify all or partial streaming data loss.
Control over the content of the notification, including subject and body via the existing notification system - Agent Dispatched Notifications (Roles and alternative notification methods).
Ability to silence alerts per node/or a custom/selectable set of nodes.
Silences should be schedule-able, and have a 'recurring' functionality. This is so that nodes that are switched off over night or other particular recurring time period can be silenced appropriately without causing in appropriate alerts.

Configuration of this functionality should be centrally manageable by the 'parent' node for the setup. This is particularly important for Business customers like MSP's where a single main parent will collect streaming data from perhaps hundreds of child nodes.

In these cases we shouldn't individually have to configure Deadman Alerts per child, they should be managed in one central location (the parent). Point 4 above would be used to control where Reachability Alerts are delivered.

Importance

must have

Value proposition

Deadman Alerts are a critical component of a monitoring solution and deserve first class status. Knowing that a child node is no longer reporting to the monitoring solution is as important as the monitoring itself. A child node that is not reporting in has:

The potential to lose metrics
To not trigger health alerts, especially when the parent is responsible for those health alerts.

If health alerts aren't triggered then the monitoring becomes effectively useless.

Because of this we believe it is absolutely essential to immediately detect if a non-ephemeral node is no longer reachable so that relevant teams can immediately investigate and rectify the issue. This is even more important in parent/child setups (the focus of this Feature Request) where the parent is responsible for health alerts.

Our aim is to avoid a situation where a node silently loses monitoring/alerting but be able to configure the parameters and notification options for receiving such Deadman Alerts and without having to use a limited Cloud Reachability function to do so.

Proposed implementation

Netdata already detects the loss of communication from a child, or at least the reconnection of streaming. We see such output in the logs on the parent:

Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: thread created with task id 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: STREAM child.monitoring.com [localhost]:38164: receive thread started
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: [child.monitoring.com]: Postponing health checks for 60 seconds, because it was just connected.
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: STREAM child.monitoring.com [receive from [localhost]:38164]: established link with negotiated capabilities: VCAPS HLABELS CLAIM CLABELS FUNCTIONS REPLICATION BINARY INTERPOLATED IEEE754 ML DYNCFG SLOTS PROGRESS 
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: STREAM_RECEIVER for 'child.monitoring.com': connected and ready to receive data 
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: Queuing status update for node=REDACTED, live=1, hops=1, queryable=1
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'streaming' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'streaming' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'netdata-api-calls' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'netdata-api-calls' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'mount-points' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'mount-points' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'containers-vms' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'containers-vms' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'systemd-services' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'systemd-services' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-interfaces' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-interfaces' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-connections' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-connections' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-sockets-tracing' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'network-sockets-tracing' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'processes' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'processes' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'config' of host 'child.monitoring.com' changed collector from 3230785 to 3740625
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: FUNCTIONS: function 'config' of host 'child.monitoring.com' changed execute callback data
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: RRDCONTEXT: received checkpoint command for claim id 'REDACTED', node id 'REDACTED', while node 'child.monitoring.com' has an active context streaming.
Aug 29 14:30:59 parent.monitoring.com netdata[3229938]: RRDCONTEXT: host 'child.monitoring.com' enabling streaming of contexts

We would propose that the Netdata Agent on the parent detects the loss of streaming metrics from a child in some appropriate manner and that would then be used to trigger the Deadman Alert in the configured way. I believe that if enabling:

[plugins]                      
        netdata monitoring extended = yes

...then it's possible to view the condition of streaming, so it's possible there is existing scaffolding in place to build this feature.

If this Feature is implemented, #17539 could also be integrated.

The text was updated successfully, but these errors were encountered:

ktsaou · 2025-02-06T17:51:47Z

Hey @dogsbody-josh I am working on this (better late than never - sorry for that).
Follow my progress on #19586 ...

dogsbody-josh added feature request New features needs triage Issues which need to be manually labelled labels Aug 29, 2024

ilyam8 mentioned this issue Sep 1, 2024

[Feat]: Add notification on child node going stale #17070

Closed

ktsaou mentioned this issue Feb 6, 2025

Streaming alerts #19586

Merged

11 tasks

ktsaou closed this as completed in #19586 Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat]: Deadman alerts for Parent/Child Netdata setups - detect loss of streaming #18436

[Feat]: Deadman alerts for Parent/Child Netdata setups - detect loss of streaming #18436

dogsbody-josh commented Aug 29, 2024

ktsaou commented Feb 6, 2025

[Feat]: Deadman alerts for Parent/Child Netdata setups - detect loss of streaming #18436

[Feat]: Deadman alerts for Parent/Child Netdata setups - detect loss of streaming #18436

Comments

dogsbody-josh commented Aug 29, 2024

Problem

Description

Importance

Value proposition

Proposed implementation

ktsaou commented Feb 6, 2025