Skip to content

Add support for systemd NOTIFY_SOCKET in agent core. #19326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Ferroin
Copy link
Member

@Ferroin Ferroin commented Jan 6, 2025

Summary

This adds support for using systemd’s service notification functionality to inform the service manager that Netdata has successfully finished startup and that it is shutting down when told to do so.

This has a couple of advantages when running Netdata under systemd as a system service:

  • The service will only actually be marked as running successfully once the agent itself has finished startup (in contrast to the current arrangement, which marks it running as soon as systemd forks a process to start the agent). This, in turn, gives us a couple of specific benefits:
    • systemctl start netdata will only return once the agent is actually running.
    • systemctl start netdata will properly report an error if there is a startup issue in the agent.
    • Any services that depend on Netdata will only be started after the agent finishes startup, ensuring that they can depend on the agent actually being running.
  • During shutdown, the service will be properly flagged as stopping.
  • Using netdatacli shutdown-agent will correctly report the shutdown to the service manager, instead of the agent just exiting unexpectedly.

Internally, this uses a stand-alone implementation of this functionality instead of relying on the one provided by libsystemd, as handling it this way enables us to support it even on static builds. The actual implementation is almost entirely copied from the MIT-0 licensed example implementation provided by systemd itself.

This is built by default on Linux systems, and should have near zero overhead for each notify call (one strlen() call, one getenv() call, and three trivial conditional checks) when not actually being used. It can be disabled at build time using -DENABLE_SYSTEMD_NOTIFY=Off. Support for other UNIX-like platforms is theoretically possible, but not particularly useful. Support for Windows would require a non-trivial restructure of some of the code, as it takes advantage of a feature that is not supported by MSVC to greatly simplify the code.

Test Plan

Basic testing involves building the agent from this PR and running it via systemd with the changes to the unit file outlined in this PR (change Type=simple to Type=notify and add NotifyAccess=exec in the [Service] section). If everything is working, the service will be marked as started slower than it normally would be, but will correctly reflect that the agent is in fact running.

Further testing can be done by intentionally inducing various startup failures in the agent and confirming that not only does the unit get never get marked started, but also the systemctl command reports an error.

Additional testing is required on non-systemd systems to confirm that the build still works there and nothing is broken.

Additional Information

Documentation of the protocol involved here can be found at https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html (or in man 3 sd_notify on most systems using systemd). This is also the source of most of the implementation included in this PR (though since it’s MIT-0 licensed, we don’t have to provide any attribution and we can also license our copy as GPL-3 like the rest of Netdata).

Long-term, I would like to expand this to include reload notification handling, as well as more concrete status updates to the service manager (which can be more easily queried and presented by front-end tools for working with systemd).

In addition to the primary changes, this tidies up the clean exit handling so that it is consistent with most of the other signal handling code and doesn’t potentially call the exit cleanup code twice.

This adds support for using systemd’s service notification functionality
to inform the service manager that Netdata has successfully finished
startup and that it is shutting down when told to do so.

The primary advantage of this for our usage is that it ensures that on
systemd systems Netdata will only be shown as running once it has
actually finished startup, instead of the current arrangement where it
is considered running as soon as the service manager forks the process.

This, in turn, gets a couple of other improvements:

- `systemctl start netdata` will only return once the agent is actually
  running.
- `systemctl start netdata` will properly report an error if there is a
  startup issue in the agent.
- Any services that depend on Netdata will only be started _after_ the
  agent finishes startup, ensuring that they can depend on the agent
  actually being running.

Internally, this uses a stand-alone implementation of this functionality
instead of relying on the one provided by libsystemd, as handling it
this way enables us to support it even on static builds. The actual
implementation is almost entirely copied from the MIT-0 licensed example
implementation provided by systemd itself.
Specifically, call out to the exit command _correctly_, and correctly
assume that it exits and never returns.
@github-actions github-actions bot added area/packaging Packaging and operating systems support area/daemon area/build Build system (autotools and cmake). labels Jan 6, 2025
@Ferroin Ferroin marked this pull request as ready for review January 6, 2025 16:17
@Ferroin Ferroin requested review from vkalintiris, a team and thiagoftsm as code owners January 6, 2025 16:17
Also, use timeout updates when appropriate, and move `notify_ready()`
call to where Netdata itself considers itself ready.
@@ -139,6 +143,11 @@ static void rrdeng_flush_everything_and_wait(bool wait_flush, bool wait_collecto
void netdata_cleanup_and_exit(int ret, const char *action, const char *action_result, const char *action_data) {
netdata_exit = 1;

#ifdef ENABLE_SYSTEMD_NOTIFY
// FIXME: This should be calculated dynamically instead of hard-coded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Ferroin ,

Do you plan to address this in a separate PR?

Best regards.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to be addressed in a different PR, largely because I don’t have sufficient knowledge of the internals of the agent to understand how to calculate this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stelfrag, after conducting several tests, primarily on Arch Linux, I did not encounter any issues starting Netdata locally. Could you please confirm if the same applies in your environment?

@@ -719,6 +723,11 @@ int netdata_main(int argc, char **argv) {
}
}

#ifdef ENABLE_SYSTEMD_NOTIFY
// FIXME: This should be computed dynamically and updated later on during startup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same question mentioned above also applies here.

*/
int notify_reloading(void) {
/* A buffer with length sufficient to format the maximum UINT64 value. */
char reload_message[sizeof("RELOADING=1\nMONOTONIC_USEC=18446744073709551615")];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a normal situation in Netdata, we would add '+1' after sizeof to account for an additional character for the null terminator.

* service timeout extension
*/
int notify_extend_timeout(uint64_t timeout) {
size_t msg_length = sizeof("EXTEND_TIMEOUT_USEC=18446744073709551615");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same explanation provided for the previous function applies here.
I also noticed that you have two different algorithms for handling variables used as arguments to snprintf. I suggest standardizing and using only one of them.
Lastly, Netdata includes the snprintfz function. I recommend using it consistently in both functions.

/* A buffer with length sufficient to format the maximum UINT64 value. */
char stop_message[msg_length];

snprintf(stop_message, msg_length, "STOPPING=1\nEXTEND_TIMEOUT_USEC=%lu", timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last comment, is also applied to this function.

* These are used to indicate what step is happening during startup or shutdown.
*/
int notify_status(const char *message) {
size_t msg_length = strlen(message) + 7;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one additional character to the length to account for the null terminator.

@thiagoftsm thiagoftsm self-requested a review January 20, 2025 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build system (autotools and cmake). area/daemon area/packaging Packaging and operating systems support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants