-
Notifications
You must be signed in to change notification settings - Fork 874
fix(agent): ensure SSH server shutdown with process groups #17227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -0,0 +1,24 @@ | |||
//go:build !windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: These are copy-pasta from agentscripts package.
|
||
func cmdCancel(ctx context.Context, logger slog.Logger, cmd *exec.Cmd) func() error { | ||
return func() error { | ||
logger.Debug(ctx, "cmdCancel: sending interrupt to process", slog.F("pid", cmd.Process.Pid)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📋🍝 suggestion: check that cmd.Process != nil
as it could have exited.
agent/agentssh/agentssh.go
Outdated
s.logger.Debug(ctx, "closing server") | ||
|
||
// Stop accepting new connections. | ||
s.logger.Debug(ctx, "closing all active listeners") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: log the number of active listeners
agent/agentssh/agentssh.go
Outdated
// Close all active sessions to gracefully | ||
// terminate client connections. | ||
s.logger.Debug(ctx, "closing all active sessions") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: log the number of active sessions
agent/agentssh/agentssh.go
Outdated
for l := range s.listeners { | ||
_ = l.Close() | ||
} | ||
s.logger.Debug(ctx, "closing all active connections") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: log the number of active conns
agent/agent.go
Outdated
// stop accepting new ones. If all processes have not exited after | ||
// 10 seconds, we just log it and move on as it's more important | ||
// to run the shutdown scripts. | ||
sshShutdownCtx, sshShutdownCancel := context.WithTimeout(a.hardCtx, 10*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can foresee someone wanting to adjust this timeout. Maybe add a CLI option for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if we do get requests to make this configurable, it will help drive the implementation. I'm not certain this approach is the end-goal we want and I would like to avoid exposing functionality that may limit future implementations.
When we start solving #6175, changing this behavior may be relevant.
I'll go ahead and reduce this to 5 seconds as well as the default docker grace timeout is 10s I believe.
The original motivation behind closing SSH before running shutdown scripts is to ensure that commands doing work exit before potentially doing backups of state or something along those lines. I think giving those commands a grace of 5 seconds is plenty in most cases.
agent/agentssh/agentssh.go
Outdated
@@ -1044,6 +1055,11 @@ func (s *Server) trackSession(ss ssh.Session, add bool) (ok bool) { | |||
// Close the server and all active connections. Server can be re-used | |||
// after Close is done. | |||
func (s *Server) Close() error { | |||
return s.close(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this extra method with the context argument? I don't see it used elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I intended to use s.close
in Shutdown
too, but I oopsied that change. I'll revert this though since ctx is only used for logger and we don't even utilize that feature often.
@@ -1082,15 +1103,36 @@ func (s *Server) Close() error { | |||
s.closing = nil | |||
s.mu.Unlock() | |||
|
|||
s.logger.Debug(ctx, "closing server done") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: log elapsed time for closing
After closer inspection of logs mentioned in #17108, it looks like the "connecting to coderd" loop was a red herring and the final log line from
(*agentssh.Server).Close
is missing.PTY's already create a process group, so they seem to have been unaffected. But non-PTY SSH sessions were able to prevent SSH server close from completing when child processes were left running. For good measure, I added a timeout mechanism just so we don't ever block on SSH server shutdown longer than 10s.
This can be verified on current
main
by starting a workspace that has a shutdown script, and running the following command:The stopping the workspace. Shutdown scripts won't run. With this change, they do.
Fixes #17108