Skip to content

feat: Add support for shutdown_script #5171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

mtojek
Copy link
Member

@mtojek mtojek commented Nov 25, 2022

Issue: #4677
Related: coder/terraform-provider-coder#79

This PR adds basic support for the shutdown script. The script is executed when the agent is intended to be closed, just before shutting down the networking.

In the follow-up PR, I will try to add a timeout.

@mtojek mtojek self-assigned this Nov 25, 2022
@mtojek mtojek changed the title Add support for shutdown_script feat: Add support for shutdown_script Nov 25, 2022
@mtojek mtojek requested review from a team and mafredri November 25, 2022 19:06
@mtojek mtojek marked this pull request as ready for review November 25, 2022 19:12
@mtojek mtojek requested a review from a team as a code owner November 25, 2022 19:12
@mtojek mtojek requested review from presleyp and removed request for a team November 25, 2022 19:12
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good and seems like it should work, but there's a few things we still need to figure out.

  1. There is no guarantee that shutdown script actually runs (this may be Terraform provider dependent)
  2. The test for shutdown works, but in practice it will not run (see below)

Regarding 1), we have no control over what way a Terraform provider will tear down its resources. Seeing this from the point of view of the agent, it could be receiving kill -[SIG] [pid] where the signal could be any of HUP, INT, TERM, KILL, etc. It'll be impossible to recover form a KILL, for instance, so the shutdown script can't run. Maybe there's also a chance that the agent will receive no signal at all, it'll just disappear.

To give a guarantee, we could initiate shutdown by communicating between server and agent that it's time for the agent to shutdown. The agent would shutdown gracefully all the services it needs to and message back to the server that everything went a-OK (or failed).

a-OK => provision down
error => surface the error, halt shutdown

That's one way I envision this could play out. There may be knobs for other scenarios, if needed. A force shutdown would not require the a-OK.

Regarding 2), I noticed we don't have proper signal handling for the agent. So even if sent kill -INT [pid], nothing will happen and shutdown script won't run. A kill -TERM [pid] will immediately kill the process, also not triggering shutdown.

I've been using the following terraform template to test it out:

Show `main.tf`
terraform {
  required_providers {
    coder = {
      source  = "coder/coder"
      version = "0.6.5"
    }
  }
}

data "coder_provisioner" "me" {
}

data "coder_workspace" "me" {
}

resource "coder_agent" "main" {
  arch               = data.coder_provisioner.me.arch
  os                 = "linux"
  connection_timeout = 15
  motd_file          = "/etc/motd"
  startup_script     = <<-EOS
    echo hi
  EOS
  shutdown_script    = <<-EOS
    set -e
    echo starting shutdown...
    sleep 10
    echo bye
  EOS
}

resource "null_resource" "workspace" {
  count = data.coder_workspace.me.start_count

  provisioner "local-exec" {
    when        = create
    command     = <<-EOS
      set -e
      dir=/tmp/coder-agent-${self.id}
      mkdir -p "$dir"
      cd ~/src/coder/coder
      go build -o "$dir"/coder ./enterprise/cmd/coder

      cd "$dir"
      export CODER_AGENT_TOKEN=${coder_agent.main.token}
      export CODER_AGENT_AUTH=token
      export CODER_AGENT_URL=http://localhost:3000
      export CODER_AGENT_ID=${self.id}

      nohup ./coder agent >/dev/null 2>&1 &
      echo $! >"$dir"/coder-agent.pid
    EOS
    interpreter = ["/bin/bash", "-c"]
  }
  provisioner "local-exec" {
    when        = destroy
    command     = <<-EOS
      set -e
      pid=/tmp/coder-agent-${self.id}/coder-agent.pid
      if pgrep -F "$pid" >/dev/null 2>&1; then
        pkill -F "$pid"
        while pgrep -F "$pid" >/dev/null 2>&1; do
          sleep 1
        done
      fi
    EOS
    interpreter = ["/bin/bash", "-c"]
  }
}

metadata, valid := rawMetadata.(codersdk.WorkspaceAgentMetadata)
if !valid {
return xerrors.Errorf("metadata is the wrong type: %T", metadata)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't do an early exit here and leave closure half-way after it's already started.

err := a.runShutdownScript(ctx, metadata.ShutdownScript)
if err != nil {
a.logger.Error(ctx, "shutdown script failed", slog.Error(err))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to do this earlier. And also, we might not want to continue on error. Critical workspace tasks could be performed at this step, like syncing filesystem to cold storage or something similar.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also, we might not want to continue on error. Critical workspace tasks could be performed at this step, like syncing filesystem to cold storage or something similar.

I assumed the opposite condition. I wouldn't like to end up with 100 workspaces I can't kill. I suppose that it depends on which option is less evil. I'm fine to change the approach to be aligned with your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s a valid thought. Another way to look at it is that if it’s allowed to fail. Why have it at all? Both are of course valid. In the PR review comment I mentioned a force stop would allow shutting down such workspaces. That’s one option, alternatively it could be a provider option but we should be careful with adding those until there are real use-cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the PR review comment I mentioned a force stop would allow shutting down such workspaces

I might have missed this comment, but do you mean to have the option to configure the forced-stop somewhere in metadata which can be changed using UI or a different place?

Copy link
Member

@mafredri mafredri Nov 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hinting that coder stop would receive a new flag, e.g. --force. Similarly the WebUI might receive a new button. This would (probably) be cli/server-only and not wait for the a-OK from the agent, instead it would perform the terraform teardown immediately.

@@ -336,6 +336,7 @@ practices:
- Manually connect to the resource and check the agent logs (e.g., `docker exec` or AWS console)
- The Coder agent logs are typically stored in `/var/log/coder-agent.log`
- The Coder agent startup script logs are typically stored in `/var/log/coder-startup-script.log`
- The Coder agent shutdown script logs are typically stored in `/var/log/coder-shutdown-script.log`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and previous) documentation seems wrong. We currently always store in /tmp (or rather, os tempdir), even if that isn't ideal.

@mtojek
Copy link
Member Author

mtojek commented Nov 28, 2022

Regarding 2), I noticed we don't have proper signal handling for the agent. So even if sent kill -INT [pid], nothing will happen and shutdown script won't run. A kill -TERM [pid] will immediately kill the process, also not triggering shutdown.

Thank you for providing the .tf file, @mafredri. I will double-check the signal handler. I expect it to execute agent.Close() consistently, so in this case, the shutdown script is not the only one affected (for instance: networking). It looks like a bug to me, not sure yet if it's easy to fix.

Maybe there's also a chance that the agent will receive no signal at all, it'll just disappear.

I'd like to see some numbers or bug reports, so that we know if it's a common issue.

To give a guarantee, we could initiate shutdown by communicating between server and agent that it's time for the agent to shutdown. The agent would shutdown gracefully all the services it needs to and message back to the server that everything went a-OK (or failed).

I admit that I would rather correct the signal handler than extend the communication flow. In this case, it would be synchronized and would require interaction with the agent coordinator (?). Also, if we decide to go this way, I think that it is valid to keep it the same way for the startup script.

@mafredri
Copy link
Member

I'd like to see some numbers or bug reports, so that we know if it's a common issue.

Best approach is to look at what the common terraform providers do (Docker behavior detailed below).

I admit that I would rather correct the signal handler than extend the communication flow. In this case, it would be synchronized and would require interaction with the agent coordinator (?). Also, if we decide to go this way, I think that it is valid to keep it the same way for the startup script.

I agree with correcting the signal handler, but it may not be enough. The docker provider, for instance, does send SIGTERM. This can be handled by the agent gracefully. However, Docker also has a grace-period (usually 10s) after which SIGKILL will be sent. This will interrupt the shutdown script.

But let's say we document this, and instruct users to set stop_timeout, etc (assuming it helps). There are still hundreds of providers out there that behave in their own way.

There's also the use case of showing startup script logs in build logs for the WebUI #2957. If we want to ensure shutdown logs are also sent to the server, we need better handling.

For instance, currently when we do coder stop workspace, the agent will almost immediately be marked as outdated:

GET http://localhost:3000/api/v2/workspaceagents/me/coordinate: unexpected status code 403: Agent trying to connect from non-latest build.

This would prevent us from sending the shutdown logs as well, which suggests we need some form of intermediary state between stop and running terraform.

@mtojek
Copy link
Member Author

mtojek commented Nov 28, 2022

But let's say we document this, and instruct users to set stop_timeout, etc (assuming it helps). There are still hundreds of providers out there that behave in their own way.

This would prevent us from sending the shutdown logs as well, which suggests we need some form of intermediary state between stop and running terraform.

Considering these requirements, I agree that we need a custom mechanism to control the shutdown procedure, no matter which provider is playing at that moment. I will evaluate your suggestion about messaging a-OK/error state back to the server. If it isn't super complex to implement, we can include it here.

Thanks for the feedback!

@mtojek mtojek marked this pull request as draft November 28, 2022 12:21
Copy link
Member

@Kira-Pilot Kira-Pilot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FE ✅

Copy link
Contributor

@presleyp presleyp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontend ✅

@github-actions
Copy link

github-actions bot commented Dec 8, 2022

This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity.

@github-actions github-actions bot added the stale This issue is like stale bread. label Dec 8, 2022
@github-actions github-actions bot closed this Dec 11, 2022
@zifeo
Copy link

zifeo commented Jan 12, 2023

@Kira-Pilot @presleyp Why is this approved but not merged? The pull request on Terraform side has been merged and I understand this cannot work without the corresponding implementation?

@presleyp
Copy link
Contributor

@zifeo we were just approving the frontend changes, and then waiting for backend approval. In the meantime, the PR got stale and was closed automatically. @mtojek, what did you decide to do about this?

@mtojek
Copy link
Member Author

mtojek commented Jan 12, 2023

There are some doubts about the stability of this solution. We discussed alternative options here. There is no decision yet on which approach we'd like to follow.

@zifeo
Copy link

zifeo commented Jan 12, 2023

@presleyp @mtojek OK, proposal seems great. In the meanwhile, maybe the one on the provider should be reversed?

@mafredri
Copy link
Member

mafredri commented Mar 6, 2023

@zifeo A PR that supersedes this one (#6139) was just merged, in case you want to check it out.

@zifeo
Copy link

zifeo commented Mar 6, 2023

@mafredri Awesome, thanks! I will test when the next version is released 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This issue is like stale bread.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants