-
Notifications
You must be signed in to change notification settings - Fork 887
feat: add support for coder_script
#9584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mafredri I'm going to just make the FE work as it did before, and then I believe this could be ready to merge. It's obviously massive, but I think there's a very low chance of regression here. I manually tested with an old version of the provider as well, and everything just works. |
c981da9
to
9a38131
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big fan of the changes in this PR, nice work!
Identified a few issues, building and tests seem broken, and I do think we'll introduce two breaking changes:
- The startup blocking behavior for old clients will break
- Old agents won't be able to send logs or may have issues communicating with coderd (didn't verify, just a hunch)
agent/agentscripts/agentscripts.go
Outdated
if script.Timeout > 0 { | ||
var cancel context.CancelFunc | ||
// Add a buffer to forcefully kill with the context. | ||
ctx, cancel = context.WithTimeout(ctx, script.Timeout+(3*time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may require a rethink. We're passing the agent context in New
and using it here (r.ctx
). This means that when the agent is interrupted, all scripts will be terminated, and shutdown scripts won't be able to run.
Currently in (*agent).Close
we use a background context as base, not the agent context. So I don't think it's a good idea tying the scripts context and agent context together. Instead we should rely on New
and Close
to perform the appropriate startup/teardown. In (agent).Close
, the Execute (stop) should run for as long as it needs to, or until the script timeout is hit (which will interrupt + wait X seconds + kill).
PS. I think this could be bumped to like 5 or 10 seconds at least.
wait = false | ||
default: | ||
return xerrors.Errorf("unknown startup script behavior %q", workspaceAgent.StartupScriptBehavior) | ||
for _, script := range workspaceAgent.Scripts { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing this change, I believe old clients connecting to coderd will be broken by this since the API no longer includes the startup script behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mafredri do you think I should just include the field and leave it as a static value to not break old clients?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylecarbs I'm assuming static would be non-blocking? For anyone using blocking
I suppose it would still be a breaking change. But I'm not sure it's worth including so much legacy to keep piping it to the DB, so I think it's acceptable. But if we leave it like that, I think we should mention it in the release notes.
This could be a separate PR, but thoughts on hiding these by default from the resources list on the dashboard? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments on the code, but other than that I also noticed one other issue:
I have two startup scripts, one blocks login and both have timed out. The workspace has entered start_timeout
lifecycle but the CLI still thinks the scripts are running.
❯ CODER_SSH_WAIT=no ssh coder.test
👋 Your workspace is outdated! Update it here: http://127.0.0.1:3000/@admin/test
==> ⧗ Running workspace agent startup scripts (non-blocking)
2023-09-25 18:19:23.047Z Startup Script 2: Hello, world 2!
2023-09-25 18:19:23.047Z Startup Script: Hello, world!
Notice: The startup scripts are still running and your workspace may be incomplete.
And this shows that the lifecycle has indeed changed (also visible in logs, but didn't attach those).
❯ ./scripts/coder-dev.sh list -o json | grep lifecycle
"lifecycle_state": "start_timeout",
Both the CLI from this PR and current main
behave the same way. I took a quick look but didn't see why this would be.
Finally, one more wish:
- It would be nice to differentiate scripts in the UI when no icon has been configured, currently a hover is required but that's not practical for 1000s of rows. Perhaps we could have like an array of 10 colors that we give the scripts based on their order/name. I say 10 colors since it seems risky to pick it randomly (if it's red it could feel like an error)
|
||
func cmdCancel(cmd *exec.Cmd) func() error { | ||
return func() error { | ||
return syscall.Kill(-cmd.Process.Pid, syscall.SIGINT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When e.g. bash
is executing a script, SIGINT
gets captured and passed to the current command being executed, this results in that command exiting and then the script executes the next command. Essentially circumventing what we're trying to do here.
Changing this to SIGHUP
will work better for what we're trying to do and will ensure everything is canceled. This is what a terminal typically sends to a program when the attached terminal goes away, so seems alright.
The downside is that perhaps it's more common to have handling for SIGINT
, in which case this could cause a dirty exit for some script authors. But we can just document this behavior instead so it's a non-issue and can still be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I appreciate the thorough thought here! 😍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-approving, this really turned out nicely btw, awesome job!
@mafredri I did the deterministic colors - that was a great suggestion! |
This allows users to define separated log streams for different scripts, and customize the names + icons.
The diff here is massive, but 90% of this is boilerplate for just piping through such a significant amount of new data to the workspace agent. The only real refactor is how the workspace agent handles scripts.
This breaks any customers that are manually piping logs with our external API as it is right now, so I'll be fixing that prior to a merge. I think I'll use a hard-coded UUID for an external source, and call it "External" for now... then eventually our external tools can catch up.