-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Add support for systemd NOTIFY_SOCKET in agent core. #19326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This adds support for using systemd’s service notification functionality to inform the service manager that Netdata has successfully finished startup and that it is shutting down when told to do so. The primary advantage of this for our usage is that it ensures that on systemd systems Netdata will only be shown as running once it has actually finished startup, instead of the current arrangement where it is considered running as soon as the service manager forks the process. This, in turn, gets a couple of other improvements: - `systemctl start netdata` will only return once the agent is actually running. - `systemctl start netdata` will properly report an error if there is a startup issue in the agent. - Any services that depend on Netdata will only be started _after_ the agent finishes startup, ensuring that they can depend on the agent actually being running. Internally, this uses a stand-alone implementation of this functionality instead of relying on the one provided by libsystemd, as handling it this way enables us to support it even on static builds. The actual implementation is almost entirely copied from the MIT-0 licensed example implementation provided by systemd itself.
Specifically, call out to the exit command _correctly_, and correctly assume that it exits and never returns.
Also, use timeout updates when appropriate, and move `notify_ready()` call to where Netdata itself considers itself ready.
@@ -139,6 +143,11 @@ static void rrdeng_flush_everything_and_wait(bool wait_flush, bool wait_collecto | |||
void netdata_cleanup_and_exit(int ret, const char *action, const char *action_result, const char *action_data) { | |||
netdata_exit = 1; | |||
|
|||
#ifdef ENABLE_SYSTEMD_NOTIFY | |||
// FIXME: This should be calculated dynamically instead of hard-coded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to be addressed in a different PR, largely because I don’t have sufficient knowledge of the internals of the agent to understand how to calculate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stelfrag, after conducting several tests, primarily on Arch Linux, I did not encounter any issues starting Netdata locally. Could you please confirm if the same applies in your environment?
@@ -719,6 +723,11 @@ int netdata_main(int argc, char **argv) { | |||
} | |||
} | |||
|
|||
#ifdef ENABLE_SYSTEMD_NOTIFY | |||
// FIXME: This should be computed dynamically and updated later on during startup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same question mentioned above also applies here.
src/daemon/systemd-notify.c
Outdated
*/ | ||
int notify_reloading(void) { | ||
/* A buffer with length sufficient to format the maximum UINT64 value. */ | ||
char reload_message[sizeof("RELOADING=1\nMONOTONIC_USEC=18446744073709551615")]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a normal situation in Netdata, we would add '+1' after sizeof to account for an additional character for the null terminator.
src/daemon/systemd-notify.c
Outdated
* service timeout extension | ||
*/ | ||
int notify_extend_timeout(uint64_t timeout) { | ||
size_t msg_length = sizeof("EXTEND_TIMEOUT_USEC=18446744073709551615"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same explanation provided for the previous function applies here.
I also noticed that you have two different algorithms for handling variables used as arguments to snprintf
. I suggest standardizing and using only one of them.
Lastly, Netdata includes the snprintfz
function. I recommend using it consistently in both functions.
src/daemon/systemd-notify.c
Outdated
/* A buffer with length sufficient to format the maximum UINT64 value. */ | ||
char stop_message[msg_length]; | ||
|
||
snprintf(stop_message, msg_length, "STOPPING=1\nEXTEND_TIMEOUT_USEC=%lu", timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last comment, is also applied to this function.
src/daemon/systemd-notify.c
Outdated
* These are used to indicate what step is happening during startup or shutdown. | ||
*/ | ||
int notify_status(const char *message) { | ||
size_t msg_length = strlen(message) + 7; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add one additional character to the length to account for the null terminator.
Summary
This adds support for using systemd’s service notification functionality to inform the service manager that Netdata has successfully finished startup and that it is shutting down when told to do so.
This has a couple of advantages when running Netdata under systemd as a system service:
systemctl start netdata
will only return once the agent is actually running.systemctl start netdata
will properly report an error if there is a startup issue in the agent.netdatacli shutdown-agent
will correctly report the shutdown to the service manager, instead of the agent just exiting unexpectedly.Internally, this uses a stand-alone implementation of this functionality instead of relying on the one provided by libsystemd, as handling it this way enables us to support it even on static builds. The actual implementation is almost entirely copied from the MIT-0 licensed example implementation provided by systemd itself.
This is built by default on Linux systems, and should have near zero overhead for each notify call (one
strlen()
call, onegetenv()
call, and three trivial conditional checks) when not actually being used. It can be disabled at build time using-DENABLE_SYSTEMD_NOTIFY=Off
. Support for other UNIX-like platforms is theoretically possible, but not particularly useful. Support for Windows would require a non-trivial restructure of some of the code, as it takes advantage of a feature that is not supported by MSVC to greatly simplify the code.Test Plan
Basic testing involves building the agent from this PR and running it via systemd with the changes to the unit file outlined in this PR (change
Type=simple
toType=notify
and addNotifyAccess=exec
in the[Service]
section). If everything is working, the service will be marked as started slower than it normally would be, but will correctly reflect that the agent is in fact running.Further testing can be done by intentionally inducing various startup failures in the agent and confirming that not only does the unit get never get marked started, but also the systemctl command reports an error.
Additional testing is required on non-systemd systems to confirm that the build still works there and nothing is broken.
Additional Information
Documentation of the protocol involved here can be found at https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html (or in
man 3 sd_notify
on most systems using systemd). This is also the source of most of the implementation included in this PR (though since it’s MIT-0 licensed, we don’t have to provide any attribution and we can also license our copy as GPL-3 like the rest of Netdata).Long-term, I would like to expand this to include reload notification handling, as well as more concrete status updates to the service manager (which can be more easily queried and presented by front-end tools for working with systemd).
In addition to the primary changes, this tidies up the clean exit handling so that it is consistent with most of the other signal handling code and doesn’t potentially call the exit cleanup code twice.