Skip to content

feat: implement agent process management #9461

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Sep 15, 2023
Merged

feat: implement agent process management #9461

merged 21 commits into from
Sep 15, 2023

Conversation

sreya
Copy link
Collaborator

@sreya sreya commented Aug 31, 2023

  • An opt-in feature has been added to the agent to allow
    deprioritizing non coder-related processes for CPU by setting their
    niceness level to 10.
  • Opting in to the feature requires setting CODER_PROC_MEMNICE_ENABLE to a non-empty value.

fixes #8517

- An opt-in feature has been added to the agent to allow
  deprioritizing non coder-related processes for both CPU
  and memory. Non coder processes have their niceness set to 10
  and their oom_score_adj set to 100
@sreya sreya requested a review from ammario September 8, 2023 22:49
@ammario
Copy link
Member

ammario commented Sep 9, 2023

Why is it opt-in? Seems like a healthy default. Also, could remove a configuration knob.

There's also implicit configuration in that the system can enable / disable priority management via the capabilities it gives the agent.

@ammario
Copy link
Member

ammario commented Sep 9, 2023

Also: aiming to have a review for this by tomorrow afternoon.

Copy link
Member

@ammario ammario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about why we're processing all processes vs. just the agent PID itself? In theory niceness and oom_score is supposed to be relative—processes that want higher priority should be able to get it without knowledge of every other process.

The current approach whereby we're iterating all processes has scalability issues too. For example, if the agent runs in a large process namespace where it only controls a small number of processes, it would generate immense log spam.

agent/agent.go Outdated
name := filepath.Base(proc.Name())
// If the process is prioritized we should adjust
// it's oom_score_adj and avoid lowering its niceness.
if slices.Contains(prioritizedProcs, name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to specifically prioritize the agent and not other coder processes right? If I'm reading this code correctly it would treat coder server and coder stat the same as the agent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good catch, I don't see that as being a big deal but we can be more discriminate about which processes we want to prioritize by also parsing command arguments. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just check if its the current process?

@sreya
Copy link
Collaborator Author

sreya commented Sep 11, 2023

I'm confused about why we're processing all processes vs. just the agent PID itself? In theory niceness and oom_score is supposed to be relative—processes that want higher priority should be able to get it without knowledge of every other process.

The current approach whereby we're iterating all processes has scalability issues too. For example, if the agent runs in a large process namespace where it only controls a small number of processes, it would generate immense log spam.

This implementation was adopted from v1 since it is re-implementing functionality requested from a v1 customer. I imagine it was implemented this way because decreasing the niceness score of a process requires CAP_SYS_NICE, whereas increasing it can be done without any additional capabilities. Coder admins are not always sysadmins meaning they may not have the ability to provide additional capabilities.

Why is it opt-in? Seems like a healthy default. Also, could remove a configuration knob.

I figured introducing this sort of functionality is safer to do opt-in, especially because interested parties only need to add a single env var to their template to enable it. If there's any unforeseen issues with the feature we avoid a serious regression by enabling it by default for every customer as opposed to the few who are specifically requesting this feature. Eventually I'd like to promote it to enabled by default once we're confident there aren't any surprises.

@ammario
Copy link
Member

ammario commented Sep 12, 2023

I'm confused about why we're processing all processes vs. just the agent PID itself? In theory niceness and oom_score is supposed to be relative—processes that want higher priority should be able to get it without knowledge of every other process.

The current approach whereby we're iterating all processes has scalability issues too. For example, if the agent runs in a large process namespace where it only controls a small number of processes, it would generate immense log spam.

This implementation was adopted from v1 since it is re-implementing functionality requested from a v1 customer. I imagine it was implemented this way because decreasing the niceness score of a process requires CAP_SYS_NICE, whereas increasing it can be done without any additional capabilities. Coder admins are not always sysadmins meaning they may not have the ability to provide additional capabilities.

Why is it opt-in? Seems like a healthy default. Also, could remove a configuration knob.

I figured introducing this sort of functionality is safer to do opt-in, especially because interested parties only need to add a single env var to their template to enable it. If there's any unforeseen issues with the feature we avoid a serious regression by enabling it by default for every customer as opposed to the few who are specifically requesting this feature. Eventually I'd like to promote it to enabled by default once we're confident there aren't any surprises.

SGTM

@sreya sreya marked this pull request as ready for review September 12, 2023 22:39
agent/agent.go Outdated
name := filepath.Base(proc.Name())
// If the process is prioritized we should adjust
// it's oom_score_adj and avoid lowering its niceness.
if slices.Contains(prioritizedProcs, name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just check if its the current process?

Copy link
Member

@deansheather deansheather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but Ammar's comment about only prioritizing based on PID should be added before merge

@sreya
Copy link
Collaborator Author

sreya commented Sep 14, 2023

why not just check if its the current process?

In the future we may want to be able to prioritize other processes such as sshd which is why it's implemented the way it is, but we can just do it pid-based for now.

@sreya sreya merged commit 7311ffb into main Sep 15, 2023
@sreya sreya deleted the jon/agentproc branch September 15, 2023 00:45
@github-actions github-actions bot locked and limited conversation to collaborators Sep 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent agents from being killed in CPU or memory-constrained workspaces
3 participants