feat: Add support for shutdown_script #5171

mtojek · 2022-11-25T11:20:57Z

Issue: #4677
Related: coder/terraform-provider-coder#79

This PR adds basic support for the shutdown script. The script is executed when the agent is intended to be closed, just before shutting down the networking.

In the follow-up PR, I will try to add a timeout.

mafredri

Code looks good and seems like it should work, but there's a few things we still need to figure out.

There is no guarantee that shutdown script actually runs (this may be Terraform provider dependent)
The test for shutdown works, but in practice it will not run (see below)

Regarding 1), we have no control over what way a Terraform provider will tear down its resources. Seeing this from the point of view of the agent, it could be receiving kill -[SIG] [pid] where the signal could be any of HUP, INT, TERM, KILL, etc. It'll be impossible to recover form a KILL, for instance, so the shutdown script can't run. Maybe there's also a chance that the agent will receive no signal at all, it'll just disappear.

To give a guarantee, we could initiate shutdown by communicating between server and agent that it's time for the agent to shutdown. The agent would shutdown gracefully all the services it needs to and message back to the server that everything went a-OK (or failed).

a-OK => provision down
error => surface the error, halt shutdown

That's one way I envision this could play out. There may be knobs for other scenarios, if needed. A force shutdown would not require the a-OK.

Regarding 2), I noticed we don't have proper signal handling for the agent. So even if sent kill -INT [pid], nothing will happen and shutdown script won't run. A kill -TERM [pid] will immediately kill the process, also not triggering shutdown.

I've been using the following terraform template to test it out:

Show `main.tf`

terraform {
  required_providers {
    coder = {
      source  = "coder/coder"
      version = "0.6.5"
    }
  }
}

data "coder_provisioner" "me" {
}

data "coder_workspace" "me" {
}

resource "coder_agent" "main" {
  arch               = data.coder_provisioner.me.arch
  os                 = "linux"
  connection_timeout = 15
  motd_file          = "/etc/motd"
  startup_script     = <<-EOS
    echo hi
  EOS
  shutdown_script    = <<-EOS
    set -e
    echo starting shutdown...
    sleep 10
    echo bye
  EOS
}

resource "null_resource" "workspace" {
  count = data.coder_workspace.me.start_count

  provisioner "local-exec" {
    when        = create
    command     = <<-EOS
      set -e
      dir=/tmp/coder-agent-${self.id}
      mkdir -p "$dir"
      cd ~/src/coder/coder
      go build -o "$dir"/coder ./enterprise/cmd/coder

      cd "$dir"
      export CODER_AGENT_TOKEN=${coder_agent.main.token}
      export CODER_AGENT_AUTH=token
      export CODER_AGENT_URL=http://localhost:3000
      export CODER_AGENT_ID=${self.id}

      nohup ./coder agent >/dev/null 2>&1 &
      echo $! >"$dir"/coder-agent.pid
    EOS
    interpreter = ["/bin/bash", "-c"]
  }
  provisioner "local-exec" {
    when        = destroy
    command     = <<-EOS
      set -e
      pid=/tmp/coder-agent-${self.id}/coder-agent.pid
      if pgrep -F "$pid" >/dev/null 2>&1; then
        pkill -F "$pid"
        while pgrep -F "$pid" >/dev/null 2>&1; do
          sleep 1
        done
      fi
    EOS
    interpreter = ["/bin/bash", "-c"]
  }
}

mafredri · 2022-11-28T10:08:27Z

agent/agent.go

+	metadata, valid := rawMetadata.(codersdk.WorkspaceAgentMetadata)
+	if !valid {
+		return xerrors.Errorf("metadata is the wrong type: %T", metadata)
+	}


We probably shouldn't do an early exit here and leave closure half-way after it's already started.

mafredri · 2022-11-28T10:10:52Z

agent/agent.go

+	err := a.runShutdownScript(ctx, metadata.ShutdownScript)
+	if err != nil {
+		a.logger.Error(ctx, "shutdown script failed", slog.Error(err))
+	}


We might want to do this earlier. And also, we might not want to continue on error. Critical workspace tasks could be performed at this step, like syncing filesystem to cold storage or something similar.

And also, we might not want to continue on error. Critical workspace tasks could be performed at this step, like syncing filesystem to cold storage or something similar.

I assumed the opposite condition. I wouldn't like to end up with 100 workspaces I can't kill. I suppose that it depends on which option is less evil. I'm fine to change the approach to be aligned with your suggestion.

It’s a valid thought. Another way to look at it is that if it’s allowed to fail. Why have it at all? Both are of course valid. In the PR review comment I mentioned a force stop would allow shutting down such workspaces. That’s one option, alternatively it could be a provider option but we should be careful with adding those until there are real use-cases.

In the PR review comment I mentioned a force stop would allow shutting down such workspaces

I might have missed this comment, but do you mean to have the option to configure the forced-stop somewhere in metadata which can be changed using UI or a different place?

I was hinting that coder stop would receive a new flag, e.g. --force. Similarly the WebUI might receive a new button. This would (probably) be cli/server-only and not wait for the a-OK from the agent, instead it would perform the terraform teardown immediately.

mafredri · 2022-11-28T10:13:05Z

docs/templates.md

@@ -336,6 +336,7 @@ practices:
 - Manually connect to the resource and check the agent logs (e.g., `docker exec` or AWS console)
  - The Coder agent logs are typically stored in `/var/log/coder-agent.log`
  - The Coder agent startup script logs are typically stored in `/var/log/coder-startup-script.log`
+  - The Coder agent shutdown script logs are typically stored in `/var/log/coder-shutdown-script.log`


This (and previous) documentation seems wrong. We currently always store in /tmp (or rather, os tempdir), even if that isn't ideal.

mtojek · 2022-11-28T11:52:28Z

Regarding 2), I noticed we don't have proper signal handling for the agent. So even if sent kill -INT [pid], nothing will happen and shutdown script won't run. A kill -TERM [pid] will immediately kill the process, also not triggering shutdown.

Thank you for providing the .tf file, @mafredri. I will double-check the signal handler. I expect it to execute agent.Close() consistently, so in this case, the shutdown script is not the only one affected (for instance: networking). It looks like a bug to me, not sure yet if it's easy to fix.

Maybe there's also a chance that the agent will receive no signal at all, it'll just disappear.

I'd like to see some numbers or bug reports, so that we know if it's a common issue.

To give a guarantee, we could initiate shutdown by communicating between server and agent that it's time for the agent to shutdown. The agent would shutdown gracefully all the services it needs to and message back to the server that everything went a-OK (or failed).

I admit that I would rather correct the signal handler than extend the communication flow. In this case, it would be synchronized and would require interaction with the agent coordinator (?). Also, if we decide to go this way, I think that it is valid to keep it the same way for the startup script.

mafredri · 2022-11-28T12:06:42Z

I'd like to see some numbers or bug reports, so that we know if it's a common issue.

Best approach is to look at what the common terraform providers do (Docker behavior detailed below).

I admit that I would rather correct the signal handler than extend the communication flow. In this case, it would be synchronized and would require interaction with the agent coordinator (?). Also, if we decide to go this way, I think that it is valid to keep it the same way for the startup script.

I agree with correcting the signal handler, but it may not be enough. The docker provider, for instance, does send SIGTERM. This can be handled by the agent gracefully. However, Docker also has a grace-period (usually 10s) after which SIGKILL will be sent. This will interrupt the shutdown script.

But let's say we document this, and instruct users to set stop_timeout, etc (assuming it helps). There are still hundreds of providers out there that behave in their own way.

There's also the use case of showing startup script logs in build logs for the WebUI #2957. If we want to ensure shutdown logs are also sent to the server, we need better handling.

For instance, currently when we do coder stop workspace, the agent will almost immediately be marked as outdated:

GET http://localhost:3000/api/v2/workspaceagents/me/coordinate: unexpected status code 403: Agent trying to connect from non-latest build.

This would prevent us from sending the shutdown logs as well, which suggests we need some form of intermediary state between stop and running terraform.

mtojek · 2022-11-28T12:21:34Z

But let's say we document this, and instruct users to set stop_timeout, etc (assuming it helps). There are still hundreds of providers out there that behave in their own way.

This would prevent us from sending the shutdown logs as well, which suggests we need some form of intermediary state between stop and running terraform.

Considering these requirements, I agree that we need a custom mechanism to control the shutdown procedure, no matter which provider is playing at that moment. I will evaluate your suggestion about messaging a-OK/error state back to the server. If it isn't super complex to implement, we can include it here.

Thanks for the feedback!

Kira-Pilot

FE ✅

presleyp

Frontend ✅

github-actions · 2022-12-08T00:10:34Z

This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity.

zifeo · 2023-01-12T19:39:12Z

@Kira-Pilot @presleyp Why is this approved but not merged? The pull request on Terraform side has been merged and I understand this cannot work without the corresponding implementation?

presleyp · 2023-01-12T19:41:08Z

@zifeo we were just approving the frontend changes, and then waiting for backend approval. In the meantime, the PR got stale and was closed automatically. @mtojek, what did you decide to do about this?

mtojek · 2023-01-12T20:02:38Z

There are some doubts about the stability of this solution. We discussed alternative options here. There is no decision yet on which approach we'd like to follow.

zifeo · 2023-01-12T20:07:21Z

@presleyp @mtojek OK, proposal seems great. In the meanwhile, maybe the one on the provider should be reversed?

mafredri · 2023-03-06T19:35:44Z

@zifeo A PR that supersedes this one (#6139) was just merged, in case you want to check it out.

zifeo · 2023-03-06T19:43:55Z

@mafredri Awesome, thanks! I will test when the next version is released 👍

mtojek added 4 commits November 24, 2022 11:29

WIP

153ac9e

Merge branch 'main' into 4677-shutdown-script

0144b77

nix-env is fixed now

855a640

./generate.sh

eee2914

mtojek self-assigned this Nov 25, 2022

mtojek changed the title ~~Add support for shutdown_script~~ feat: Add support for shutdown_script Nov 25, 2022

mtojek added 4 commits November 25, 2022 12:37

Fix version diff

f28ca32

Implement database model

887a2b0

More code

6831ab3

Run script

29b6127

mtojek requested review from a team and mafredri November 25, 2022 19:06

mtojek marked this pull request as ready for review November 25, 2022 19:12

mtojek requested a review from a team as a code owner November 25, 2022 19:12

mtojek requested review from presleyp and removed request for a team November 25, 2022 19:12

mafredri reviewed Nov 28, 2022

View reviewed changes

mtojek marked this pull request as draft November 28, 2022 12:21

Kira-Pilot approved these changes Nov 28, 2022

View reviewed changes

presleyp approved these changes Nov 28, 2022

View reviewed changes

Merge branch 'main' into 4677-shutdown-script

297dcfe

mtojek mentioned this pull request Nov 30, 2022

Run script before workspace stop #4677

Closed

github-actions bot added the stale This issue is like stale bread. label Dec 8, 2022

github-actions bot closed this Dec 11, 2022

mafredri mentioned this pull request Jan 17, 2023

Change agent startup script behavior from being never-ending to indicating the workspace is ready on end #5749

Closed

13 tasks

mafredri mentioned this pull request Feb 9, 2023

feat(agent): Add shutdown lifecycle states and shutdown_script support #6139

Merged

feat: Add support for shutdown_script #5171

feat: Add support for shutdown_script #5171

Uh oh!

Conversation

mtojek commented Nov 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mafredri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

mafredri Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

mtojek Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

mafredri Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

mtojek Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

mafredri Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri Nov 28, 2022

Choose a reason for hiding this comment

Uh oh!

mtojek commented Nov 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mafredri commented Nov 28, 2022

Uh oh!

mtojek commented Nov 28, 2022

Uh oh!

Kira-Pilot left a comment

Choose a reason for hiding this comment

Uh oh!

presleyp left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 8, 2022

Uh oh!

zifeo commented Jan 12, 2023

Uh oh!

presleyp commented Jan 12, 2023

Uh oh!

mtojek commented Jan 12, 2023

Uh oh!

zifeo commented Jan 12, 2023

Uh oh!

mafredri commented Mar 6, 2023

Uh oh!

zifeo commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mtojek commented Nov 25, 2022 •

edited

Loading

mafredri left a comment •

edited

Loading

mafredri Nov 28, 2022 •

edited

Loading

mtojek commented Nov 28, 2022 •

edited

Loading

zifeo commented Mar 6, 2023 •

edited

Loading