-
Notifications
You must be signed in to change notification settings - Fork 881
feat: Add support for shutdown_script #5171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good and seems like it should work, but there's a few things we still need to figure out.
- There is no guarantee that shutdown script actually runs (this may be Terraform provider dependent)
- The test for shutdown works, but in practice it will not run (see below)
Regarding 1), we have no control over what way a Terraform provider will tear down its resources. Seeing this from the point of view of the agent, it could be receiving kill -[SIG] [pid]
where the signal could be any of HUP
, INT
, TERM
, KILL
, etc. It'll be impossible to recover form a KILL
, for instance, so the shutdown script can't run. Maybe there's also a chance that the agent will receive no signal at all, it'll just disappear.
To give a guarantee, we could initiate shutdown by communicating between server and agent that it's time for the agent to shutdown. The agent would shutdown gracefully all the services it needs to and message back to the server that everything went a-OK (or failed).
a-OK => provision down
error => surface the error, halt shutdown
That's one way I envision this could play out. There may be knobs for other scenarios, if needed. A force shutdown would not require the a-OK.
Regarding 2), I noticed we don't have proper signal handling for the agent. So even if sent kill -INT [pid]
, nothing will happen and shutdown script won't run. A kill -TERM [pid]
will immediately kill the process, also not triggering shutdown.
I've been using the following terraform template to test it out:
Show `main.tf`
terraform {
required_providers {
coder = {
source = "coder/coder"
version = "0.6.5"
}
}
}
data "coder_provisioner" "me" {
}
data "coder_workspace" "me" {
}
resource "coder_agent" "main" {
arch = data.coder_provisioner.me.arch
os = "linux"
connection_timeout = 15
motd_file = "/etc/motd"
startup_script = <<-EOS
echo hi
EOS
shutdown_script = <<-EOS
set -e
echo starting shutdown...
sleep 10
echo bye
EOS
}
resource "null_resource" "workspace" {
count = data.coder_workspace.me.start_count
provisioner "local-exec" {
when = create
command = <<-EOS
set -e
dir=/tmp/coder-agent-${self.id}
mkdir -p "$dir"
cd ~/src/coder/coder
go build -o "$dir"/coder ./enterprise/cmd/coder
cd "$dir"
export CODER_AGENT_TOKEN=${coder_agent.main.token}
export CODER_AGENT_AUTH=token
export CODER_AGENT_URL=http://localhost:3000
export CODER_AGENT_ID=${self.id}
nohup ./coder agent >/dev/null 2>&1 &
echo $! >"$dir"/coder-agent.pid
EOS
interpreter = ["/bin/bash", "-c"]
}
provisioner "local-exec" {
when = destroy
command = <<-EOS
set -e
pid=/tmp/coder-agent-${self.id}/coder-agent.pid
if pgrep -F "$pid" >/dev/null 2>&1; then
pkill -F "$pid"
while pgrep -F "$pid" >/dev/null 2>&1; do
sleep 1
done
fi
EOS
interpreter = ["/bin/bash", "-c"]
}
}
metadata, valid := rawMetadata.(codersdk.WorkspaceAgentMetadata) | ||
if !valid { | ||
return xerrors.Errorf("metadata is the wrong type: %T", metadata) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably shouldn't do an early exit here and leave closure half-way after it's already started.
err := a.runShutdownScript(ctx, metadata.ShutdownScript) | ||
if err != nil { | ||
a.logger.Error(ctx, "shutdown script failed", slog.Error(err)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to do this earlier. And also, we might not want to continue on error. Critical workspace tasks could be performed at this step, like syncing filesystem to cold storage or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And also, we might not want to continue on error. Critical workspace tasks could be performed at this step, like syncing filesystem to cold storage or something similar.
I assumed the opposite condition. I wouldn't like to end up with 100 workspaces I can't kill. I suppose that it depends on which option is less evil. I'm fine to change the approach to be aligned with your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s a valid thought. Another way to look at it is that if it’s allowed to fail. Why have it at all? Both are of course valid. In the PR review comment I mentioned a force stop would allow shutting down such workspaces. That’s one option, alternatively it could be a provider option but we should be careful with adding those until there are real use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR review comment I mentioned a force stop would allow shutting down such workspaces
I might have missed this comment, but do you mean to have the option to configure the forced-stop somewhere in metadata which can be changed using UI or a different place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hinting that coder stop
would receive a new flag, e.g. --force
. Similarly the WebUI might receive a new button. This would (probably) be cli/server-only and not wait for the a-OK from the agent, instead it would perform the terraform teardown immediately.
@@ -336,6 +336,7 @@ practices: | |||
- Manually connect to the resource and check the agent logs (e.g., `docker exec` or AWS console) | |||
- The Coder agent logs are typically stored in `/var/log/coder-agent.log` | |||
- The Coder agent startup script logs are typically stored in `/var/log/coder-startup-script.log` | |||
- The Coder agent shutdown script logs are typically stored in `/var/log/coder-shutdown-script.log` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and previous) documentation seems wrong. We currently always store in /tmp
(or rather, os tempdir), even if that isn't ideal.
Thank you for providing the .tf file, @mafredri. I will double-check the signal handler. I expect it to execute
I'd like to see some numbers or bug reports, so that we know if it's a common issue.
I admit that I would rather correct the signal handler than extend the communication flow. In this case, it would be synchronized and would require interaction with the agent coordinator (?). Also, if we decide to go this way, I think that it is valid to keep it the same way for the startup script. |
Best approach is to look at what the common terraform providers do (Docker behavior detailed below).
I agree with correcting the signal handler, but it may not be enough. The docker provider, for instance, does send SIGTERM. This can be handled by the agent gracefully. However, Docker also has a grace-period (usually 10s) after which SIGKILL will be sent. This will interrupt the shutdown script. But let's say we document this, and instruct users to set There's also the use case of showing startup script logs in build logs for the WebUI #2957. If we want to ensure shutdown logs are also sent to the server, we need better handling. For instance, currently when we do
This would prevent us from sending the shutdown logs as well, which suggests we need some form of intermediary state between stop and running terraform. |
Considering these requirements, I agree that we need a custom mechanism to control the shutdown procedure, no matter which provider is playing at that moment. I will evaluate your suggestion about messaging a-OK/error state back to the server. If it isn't super complex to implement, we can include it here. Thanks for the feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE ✅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frontend ✅
This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity. |
@Kira-Pilot @presleyp Why is this approved but not merged? The pull request on Terraform side has been merged and I understand this cannot work without the corresponding implementation? |
There are some doubts about the stability of this solution. We discussed alternative options here. There is no decision yet on which approach we'd like to follow. |
@mafredri Awesome, thanks! I will test when the next version is released 👍 |
Issue: #4677
Related: coder/terraform-provider-coder#79
This PR adds basic support for the shutdown script. The script is executed when the agent is intended to be closed, just before shutting down the networking.
In the follow-up PR, I will try to add a timeout.