Skip to content

feat: allow execution ordering for coder_script #10352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #10664
stirby opened this issue Oct 19, 2023 · 18 comments
Open
Tracked by #10664

feat: allow execution ordering for coder_script #10352

stirby opened this issue Oct 19, 2023 · 18 comments
Labels
needs-rfc Issues that needs an RFC due to an expansive scope and unclear implementation path.

Comments

@stirby
Copy link
Collaborator

stirby commented Oct 19, 2023

Overview

Currently, when building a workspace with multiple instances of coder_script, all are run concurrently, preventing any sequencing logic. To remedy this, some serialization should be implemented allowing users to either define the sequence manually or through a dependency graph.

@spikecurtis
Copy link
Contributor

I'd actually call this a bug. We don't know a priori that scripts are safe to run concurrently. In particular, I use dotfiles to install software on startup, and this fails if other scripts are also doing dpkg stuff

Running install.sh...
E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 176 (apt-get)
E: Unable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), is another process using it?
sudo: add-apt-repository: command not found

@stirby stirby added s2 Broken use cases or features (with a workaround). Only humans may set this. bug and removed feature labels Oct 30, 2023
@matifali
Copy link
Member

matifali commented Oct 30, 2023

It could be done in multiple ways,

  1. Use a syntax like run_after = coder_script.dotfiles. If no run_after exists, we can default to running alphabetically.
resource "coder_script" "dotfiles" {
  agent_id = coder_agent.main.id
  ...
}

resource "coder_script" "personalize" {
  agent_id = coder_agent.main.id
  ...
  run_after = coder_script.dotfiles
}
  1. We use order the same as with coder_parameter, It may look something like the below,
resource "coder_script" "dotfiles" {
  agent_id = coder_agent.main.id
  ...
  order    = "1"
}

resource "coder_script" "personalize" {
  agent_id = coder_agent.main.id
  ...
  order    = "2"
}
  1. I am not sure if we can also use the built-in terraform depends_on as the scripts are run after the terraform apply has finished (in a running workspace by the coder_agent)
resource "coder_script" "dotfiles" {
  agent_id = coder_agent.main.id
  ...
}

resource "coder_script" "personalize" {
  agent_id   = coder_agent.main.id
  ...
  depends_on = ["coder_script.dotfiles"]
}

I am not very sure how we can expose this on module level and how the coder_agent will use this.

@kconley-sq
Copy link
Contributor

I think it'd be neat if depends_on could be used for this because otherwise I think the run_after or order approaches would require Coder (Terraform) modules providing coder_script's to expose those fields as module input variables for consumers of the module to set them, and I imagine that exposing may not be done consistently (at least for 3rd party modules).

@ammario
Copy link
Member

ammario commented Nov 13, 2023

I strongly prefer a depends_on or similar DAG-based approach as its consistent with Terraform's philosophy.

@matifali
Copy link
Member

This is a prerequisite for coder/registry#82

@bpmct
Copy link
Member

bpmct commented Nov 16, 2023

We had a chat with @kylecarbs on this yesterday. Here are some conclusions.

  • depends_on is very idiomatic but may be difficult for module 1 to wait on a script for module 2. We're not entirely sure so this should require a bit of investigation
  • Another idea is for a startup script to do something inline to wait on another script
    coder script wait code-server
    code-server --install-extension myextension
    • This was @bpmct and @kylecarbs preference because it also allows us to do stuff like coder script follow to see logs or even coder script restart if something fails. A pretty good new abstraction :)
  • We definitely want to avoid an order property

@spikecurtis
Copy link
Contributor

This is getting complex enough that we should have an RFC, IMO. But, since I'm thinking about it right now, I'll add some comments directly to the issue.


depends_on sort of makes sense if you squint, but usually it's for telling Terraform about dependencies such that terraform apply does things in the right order. Here, it's not terraform apply that needs to get stuff in the right order, it's the script execution in the agent.


One big downside of coder script wait is that the script itself needs to be aware of the exact thing it is waiting for. But, when scripts are encapsulated in modules, the module author might not know about a problem. For example, my dotfiles consistently fail to run install.sh because another script is holding the dpkg lock. In order to use coder script wait, firstly, it would mean my dotfiles only work on Coder, and secondly, they would probably only work on a specific Coder template where I can figure out what script is conflicting.

Another related issue is that a module author might know what it is dependent on, but not which module provides it. Some dependencies might not be provided by any modules and instead are baked in. For example, my module might require curl. Maybe curl is just available in my image, or maybe some other script or module installs it, as a module author I shouldn't have to know as this creates inflexible dependencies.

Therefore, I think we need to be able to define module dependencies in the template, so that the template author has flexibility to control dependencies, rather than forcing the module author to.


One last comment is that modules generally don't expose their scripts as outputs, so we need to handle all combinations of module and script dependencies. Then, in the terraform provisioner, we can resolve the DAG of scripts & modules into a DAG of just scripts.

  • script -> script
  • script -> module
    • the script gets an edge to every script in the target module
  • module -> script
    • every script in the module gets an edge to the target script
  • module -> module
    • every script in the module gets an edge to every script in the target module

@matifali
Copy link
Member

I agree with @spikecurtis reservation and on having an RFC on this.

My thoughts:

  1. Modules can or cannot contain coder_script resources depending on the nature of what the module does. We have modules that only add some coder_parameter data items to the workspace form.
  2. For Modules that use coder_script, they can have multiple coder_script resources, and in that case, we do need an order property that the module builder will use or if someone is directly using coder_script resources in a template.
  3. For all modules that use any coder_script resource, we must pass an agent_id in the form of coder_agent.NAME.id as input. So, for one module to depend on another, we need a way to let the agent running the module's coder_script resources wait for any dependency modules' ' coder_script` resources to finish executing. I am unsure how it can be done while staying close to what we have now.

@bpmct
Copy link
Member

bpmct commented Nov 18, 2023

Workaround

For those looking for this behavior today, you can use a bash while loop in your coder script to wait for a specific command (e.g. code-server to become available or conditionals to wait for a process to complete, start, or be unlocked)

@michaelbrewer
Copy link
Contributor

Workaround

For those looking for this behavior today, you can use a bash while loop in your coder script to wait for a specific command (e.g. code-server to become available or conditionals to wait for a process to complete, start, or be unlocked)

This kind of works, but there are many other cases, like git-clone script needs to complete before executing the next script from another module.

@MrPeacockNLB
Copy link
Contributor

MrPeacockNLB commented Apr 5, 2024

We are using more than 16 scripts at startup. Some of them has to wait until git cloned has finished. Other has to wait for other scripts or at least part of it. We are providing 2 simple bash scripts to realize the sendSignal and waitFor. This creates a file under /tmp and the other waits in a while loop until it is created. This way we can build our own DAG.

NOTE: Sending signals should be possible at any time or any place. Not just at the edges of entry and exit points!

This works pretty stable. But it would be nice if there was something similar as build in function without needing to store files in a temporary filesystem. For me I could think of storing signals at workspace level.

coder signal git-cloned # This sends the signal to coder server
coder waitfor git-cloned # wait until signal "git-clond" has been send to coder server

@michaelbrewer I can say this works as expected

@michaelbrewer
Copy link
Contributor

Oh thanks. I need to add this to my going list of notes.

@evilhamsterman
Copy link

evilhamsterman commented May 30, 2024

Workaround

For those looking for this behavior today, you can use a bash while loop in your coder script to wait for a specific command (e.g. code-server to become available or conditionals to wait for a process to complete, start, or be unlocked)

If you are running into dpkg locks you can use apt-get -o DPkg::Lock::Timeout=60 <normal install/update commands that will make apt wait 60 seconds, or -1 if you want it to wait forever, for the dpkg lock to clear.

@michaelbrewer
Copy link
Contributor

I would like to add support for coder signals for some of the common coder terraform modules like git-clone etc.

@matifali matifali removed the bug label Oct 14, 2024
@djarbz
Copy link

djarbz commented Oct 22, 2024

I like this.

coder signal git-cloned # This sends the signal to coder server
coder waitfor git-cloned # wait until signal "git-clond" has been send to coder server

But I would add coder waitlock <lock> <timeout>.
Then we could request a lock for any signal and re-lock it. This way multiple scripts could queue for a lock such as for apt-get.

I'd also like to add that this is inherently prevalent with the new KasmVNC module.

@mafredri
Copy link
Member

I noticed that most of the solutions here seem to focus on running something after another script has finished. This is probably the most common use case. But I think it's equally important to be able to define the inverse dependency, which means telling a coder module or script to wait for my script to finish.

Simple use-case: Modify apt mirrors before other scripts install tools for better performance.

The coder script wait/coder signal proposals can't easily handle this case because the dependencies are evaluated at script runtime.

I think it would be great if we could create an RFC on this.

@matifali matifali added needs-rfc Issues that needs an RFC due to an expansive scope and unclear implementation path. and removed s2 Broken use cases or features (with a workaround). Only humans may set this. labels Oct 31, 2024
@spikecurtis
Copy link
Contributor

My high level view is that the Coder agent is essentially playing the role of an init system, especially in containerized deployments (VMs can use the init system that comes with the OS in many cases).

As such, we should pay attention to prior art in designing what we are going to do. Let's try to avoid redesigning the wheel, or designing it piecemeal without the long term view of what it is.

@mafredri
Copy link
Member

mafredri commented Nov 5, 2024

That's exactly right, @spikecurtis and so far we've implemented a very basic init for start, shutdown and cron. Even running services is currently just a hack on top of "start" and we don't support service restart or stop either.

I agree it would be great if we didn't have to redesign the wheel, but we may have to recreate it.

As a hypothetical, let's say we make systemd work inside the workspace. Then the agent no longer needs "scripts" because we can give it a list of files (and their contents) to write onto disk. These files will configure systemd which is then started by the agent. Each coder module can then append to this list of files to write its own service definitions (long-running, one-shot, cron, etc). This addresses the issue raise here as it just so happens systemd has a solution to manage dependencies (by setting e.g. Before=coder-module-code-server.service, After=coder-module-dotfiles.service).

Portability is a big concern though, and a reason we can't use systemd, s6, OpenRC or any other popular init system written in C. We don't need all the bells and whistles, but unless we consider the broader lifecycle of a workspace, how they're configured, and how things are run inside them, we will end up with partial solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rfc Issues that needs an RFC due to an expansive scope and unclear implementation path.
Projects
None yet
Development

No branches or pull requests