Skip to content

fix(agent): ensure SSH server shutdown with process groups #17227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 3, 2025

Conversation

mafredri
Copy link
Member

@mafredri mafredri commented Apr 2, 2025

After closer inspection of logs mentioned in #17108, it looks like the "connecting to coderd" loop was a red herring and the final log line from (*agentssh.Server).Close is missing.

PTY's already create a process group, so they seem to have been unaffected. But non-PTY SSH sessions were able to prevent SSH server close from completing when child processes were left running. For good measure, I added a timeout mechanism just so we don't ever block on SSH server shutdown longer than 10s.

This can be verified on current main by starting a workspace that has a shutdown script, and running the following command:

ssh coder.testa "/bin/bash -c 'trap \"sleep 6000\" SIGTERM; sleep 6000'"

The stopping the workspace. Shutdown scripts won't run. With this change, they do.

Fixes #17108

@@ -0,0 +1,24 @@
//go:build !windows
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: These are copy-pasta from agentscripts package.

@mafredri mafredri marked this pull request as ready for review April 2, 2025 15:54

func cmdCancel(ctx context.Context, logger slog.Logger, cmd *exec.Cmd) func() error {
return func() error {
logger.Debug(ctx, "cmdCancel: sending interrupt to process", slog.F("pid", cmd.Process.Pid))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📋🍝 suggestion: check that cmd.Process != nil as it could have exited.

s.logger.Debug(ctx, "closing server")

// Stop accepting new connections.
s.logger.Debug(ctx, "closing all active listeners")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: log the number of active listeners

// Close all active sessions to gracefully
// terminate client connections.
s.logger.Debug(ctx, "closing all active sessions")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: log the number of active sessions

for l := range s.listeners {
_ = l.Close()
}
s.logger.Debug(ctx, "closing all active connections")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: log the number of active conns

agent/agent.go Outdated
// stop accepting new ones. If all processes have not exited after
// 10 seconds, we just log it and move on as it's more important
// to run the shutdown scripts.
sshShutdownCtx, sshShutdownCancel := context.WithTimeout(a.hardCtx, 10*time.Second)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can foresee someone wanting to adjust this timeout. Maybe add a CLI option for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we do get requests to make this configurable, it will help drive the implementation. I'm not certain this approach is the end-goal we want and I would like to avoid exposing functionality that may limit future implementations.

When we start solving #6175, changing this behavior may be relevant.

I'll go ahead and reduce this to 5 seconds as well as the default docker grace timeout is 10s I believe.

The original motivation behind closing SSH before running shutdown scripts is to ensure that commands doing work exit before potentially doing backups of state or something along those lines. I think giving those commands a grace of 5 seconds is plenty in most cases.

@@ -1044,6 +1055,11 @@ func (s *Server) trackSession(ss ssh.Session, add bool) (ok bool) {
// Close the server and all active connections. Server can be re-used
// after Close is done.
func (s *Server) Close() error {
return s.close(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this extra method with the context argument? I don't see it used elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I intended to use s.close in Shutdown too, but I oopsied that change. I'll revert this though since ctx is only used for logger and we don't even utilize that feature often.

@@ -1082,15 +1103,36 @@ func (s *Server) Close() error {
s.closing = nil
s.mu.Unlock()

s.logger.Debug(ctx, "closing server done")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: log elapsed time for closing

@mafredri mafredri merged commit b61f0ab into main Apr 3, 2025
34 checks passed
@mafredri mafredri deleted the mafredri/fix-agent-agentssh-close-block branch April 3, 2025 13:01
@github-actions github-actions bot locked and limited conversation to collaborators Apr 3, 2025
@mafredri mafredri changed the title fix(agent): improve SSH server shutdown and continue after timeout fix(agent): ensure SSH server shutdown with process groups Apr 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

investigate why run_on_stop doesn't work consistently
2 participants