Add support for systemd NOTIFY_SOCKET in agent core. #19326

Ferroin · 2025-01-06T14:39:07Z

Summary

This adds support for using systemd’s service notification functionality to inform the service manager that Netdata has successfully finished startup and that it is shutting down when told to do so.

This has a couple of advantages when running Netdata under systemd as a system service:

The service will only actually be marked as running successfully once the agent itself has finished startup (in contrast to the current arrangement, which marks it running as soon as systemd forks a process to start the agent). This, in turn, gives us a couple of specific benefits:
- systemctl start netdata will only return once the agent is actually running.
- systemctl start netdata will properly report an error if there is a startup issue in the agent.
- Any services that depend on Netdata will only be started after the agent finishes startup, ensuring that they can depend on the agent actually being running.
During shutdown, the service will be properly flagged as stopping.
Using netdatacli shutdown-agent will correctly report the shutdown to the service manager, instead of the agent just exiting unexpectedly.

Internally, this uses a stand-alone implementation of this functionality instead of relying on the one provided by libsystemd, as handling it this way enables us to support it even on static builds. The actual implementation is almost entirely copied from the MIT-0 licensed example implementation provided by systemd itself.

This is built by default on Linux systems, and should have near zero overhead for each notify call (one strlen() call, one getenv() call, and three trivial conditional checks) when not actually being used. It can be disabled at build time using -DENABLE_SYSTEMD_NOTIFY=Off. Support for other UNIX-like platforms is theoretically possible, but not particularly useful. Support for Windows would require a non-trivial restructure of some of the code, as it takes advantage of a feature that is not supported by MSVC to greatly simplify the code.

Test Plan

Basic testing involves building the agent from this PR and running it via systemd with the changes to the unit file outlined in this PR (change Type=simple to Type=notify and add NotifyAccess=exec in the [Service] section). If everything is working, the service will be marked as started slower than it normally would be, but will correctly reflect that the agent is in fact running.

Further testing can be done by intentionally inducing various startup failures in the agent and confirming that not only does the unit get never get marked started, but also the systemctl command reports an error.

Additional testing is required on non-systemd systems to confirm that the build still works there and nothing is broken.

Additional Information

Documentation of the protocol involved here can be found at https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html (or in man 3 sd_notify on most systems using systemd). This is also the source of most of the implementation included in this PR (though since it’s MIT-0 licensed, we don’t have to provide any attribution and we can also license our copy as GPL-3 like the rest of Netdata).

Long-term, I would like to expand this to include reload notification handling, as well as more concrete status updates to the service manager (which can be more easily queried and presented by front-end tools for working with systemd).

In addition to the primary changes, this tidies up the clean exit handling so that it is consistent with most of the other signal handling code and doesn’t potentially call the exit cleanup code twice.

This adds support for using systemd’s service notification functionality to inform the service manager that Netdata has successfully finished startup and that it is shutting down when told to do so. The primary advantage of this for our usage is that it ensures that on systemd systems Netdata will only be shown as running once it has actually finished startup, instead of the current arrangement where it is considered running as soon as the service manager forks the process. This, in turn, gets a couple of other improvements: - `systemctl start netdata` will only return once the agent is actually running. - `systemctl start netdata` will properly report an error if there is a startup issue in the agent. - Any services that depend on Netdata will only be started _after_ the agent finishes startup, ensuring that they can depend on the agent actually being running. Internally, this uses a stand-alone implementation of this functionality instead of relying on the one provided by libsystemd, as handling it this way enables us to support it even on static builds. The actual implementation is almost entirely copied from the MIT-0 licensed example implementation provided by systemd itself.

Specifically, call out to the exit command _correctly_, and correctly assume that it exits and never returns.

Also, use timeout updates when appropriate, and move `notify_ready()` call to where Netdata itself considers itself ready.

thiagoftsm · 2025-01-16T12:36:28Z

src/daemon/daemon-shutdown.c

@@ -139,6 +143,11 @@ static void rrdeng_flush_everything_and_wait(bool wait_flush, bool wait_collecto
 void netdata_cleanup_and_exit(int ret, const char *action, const char *action_result, const char *action_data) {
    netdata_exit = 1;

+#ifdef ENABLE_SYSTEMD_NOTIFY
+    // FIXME: This should be calculated dynamically instead of hard-coded.


Hello @Ferroin ,

Do you plan to address this in a separate PR?

Best regards.

It needs to be addressed in a different PR, largely because I don’t have sufficient knowledge of the internals of the agent to understand how to calculate this.

@stelfrag, after conducting several tests, primarily on Arch Linux, I did not encounter any issues starting Netdata locally. Could you please confirm if the same applies in your environment?

thiagoftsm · 2025-01-16T12:37:43Z

src/daemon/main.c

@@ -719,6 +723,11 @@ int netdata_main(int argc, char **argv) {
        }
    }

+#ifdef ENABLE_SYSTEMD_NOTIFY
+    // FIXME: This should be computed dynamically and updated later on during startup


The same question mentioned above also applies here.

thiagoftsm · 2025-01-16T12:45:01Z

src/daemon/systemd-notify.c

+ */
+int notify_reloading(void) {
+  /* A buffer with length sufficient to format the maximum UINT64 value. */
+  char reload_message[sizeof("RELOADING=1\nMONOTONIC_USEC=18446744073709551615")];


In a normal situation in Netdata, we would add '+1' after sizeof to account for an additional character for the null terminator.

thiagoftsm · 2025-01-16T12:55:38Z

src/daemon/systemd-notify.c

+ * service timeout extension
+ */
+int notify_extend_timeout(uint64_t timeout) {
+  size_t msg_length = sizeof("EXTEND_TIMEOUT_USEC=18446744073709551615");


The same explanation provided for the previous function applies here.
I also noticed that you have two different algorithms for handling variables used as arguments to snprintf. I suggest standardizing and using only one of them.
Lastly, Netdata includes the snprintfz function. I recommend using it consistently in both functions.

thiagoftsm · 2025-01-16T12:56:06Z

src/daemon/systemd-notify.c

+  /* A buffer with length sufficient to format the maximum UINT64 value. */
+  char stop_message[msg_length];
+
+  snprintf(stop_message, msg_length, "STOPPING=1\nEXTEND_TIMEOUT_USEC=%lu", timeout);


The last comment, is also applied to this function.

thiagoftsm · 2025-01-16T12:57:24Z

src/daemon/systemd-notify.c

+ * These are used to indicate what step is happening during startup or shutdown.
+ */
+int notify_status(const char *message) {
+  size_t msg_length = strlen(message) + 7;


Please add one additional character to the length to account for the null terminator.

Ferroin added 3 commits January 6, 2025 08:30

Move shutdown notification to proper shutdown path.

1d1dab4

Fix up signal exit path.

9ad3700

Specifically, call out to the exit command _correctly_, and correctly assume that it exits and never returns.

github-actions bot added area/packaging Packaging and operating systems support area/daemon area/build Build system (autotools and cmake). labels Jan 6, 2025

Ferroin marked this pull request as ready for review January 6, 2025 16:17

Ferroin requested review from vkalintiris, a team and thiagoftsm as code owners January 6, 2025 16:17

Extend notification API to support status and timeout updates.

e26d75e

Also, use timeout updates when appropriate, and move `notify_ready()` call to where Netdata itself considers itself ready.

thiagoftsm reviewed Jan 16, 2025

View reviewed changes

thiagoftsm added 3 commits January 16, 2025 22:35

systemd-service-notify: Remove compilation warning

8a83c40

systemd-service-notify: Address comments

ed99c58

systemd-service-notify: Format code to netdata standard

8391919

thiagoftsm self-requested a review January 20, 2025 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for systemd NOTIFY_SOCKET in agent core. #19326

Add support for systemd NOTIFY_SOCKET in agent core. #19326

Uh oh!

Ferroin commented Jan 6, 2025 •

edited

Loading

Uh oh!

thiagoftsm Jan 16, 2025

Uh oh!

Ferroin Jan 16, 2025

Uh oh!

thiagoftsm Jan 17, 2025

Uh oh!

thiagoftsm Jan 16, 2025

Uh oh!

thiagoftsm Jan 16, 2025

Uh oh!

thiagoftsm Jan 16, 2025

Uh oh!

thiagoftsm Jan 16, 2025

Uh oh!

thiagoftsm Jan 16, 2025

Uh oh!

Uh oh!

Add support for systemd NOTIFY_SOCKET in agent core. #19326

Are you sure you want to change the base?

Add support for systemd NOTIFY_SOCKET in agent core. #19326

Uh oh!

Conversation

Ferroin commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Additional Information

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Ferroin commented Jan 6, 2025 •

edited

Loading