Skip to content

[FG:InPlacePodVerticalScaling] Move pod resource allocation management out of the status manager #130254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 28, 2025

Conversation

tallclair
Copy link
Member

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Refactoring pod resource allocation management out of status manager, in preparation for expanding the scope of things the allocation manager needs to track. Pod allocations are used beyond setting the pod status, so storing them in the status manager doesn't make sense.

Special notes for your reviewer:

I've tried to organize the commits to make this easier to review:

  1. Move allocation state out of statusmanager - This is just a straight move of the state package
  2. Extract pod allocation manager from status manager - Get*, Set* and Update* methods are directly moved from status manager. Initialize the new allocation.Manager, and add the Delete* and RemoveOrphanedPods functions. Embed the allocation.Manager directly in status.Manager so that callers don't need to be updated (yet).
  3. Call allocationManager directly - Update callers, remove the embedded allocation.Manager from status.Manager.

Does this PR introduce a user-facing change?

NONE

/sig node
/priority important-soon

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. sig/node Categorizes an issue or PR as relevant to SIG Node. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 19, 2025
@tallclair
Copy link
Member Author

/assign @SergeyKanzhelev

@k8s-ci-robot k8s-ci-robot added area/kubelet approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 19, 2025
@tallclair tallclair force-pushed the allocation-manager-2 branch from ea3fad5 to 5bd8f29 Compare February 19, 2025 03:49
@ffromani
Copy link
Contributor

/cc

stateImpl, err := state.NewStateCheckpoint(checkpointDirectory, podStatusManagerStateFile)
if err != nil {
// This is a crictical, non-recoverable failure.
klog.ErrorS(err, "Could not initialize pod allocation checkpoint manager, please drain node and remove policy state file")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note that https://kubernetes.io/docs/reference/node/kubelet-files/ doesn't mention what a policy state file is.

Copy link
Member Author

@tallclair tallclair Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The details are in the error, so most of this message is redundant. I cleaned it up in another commit.

@tallclair tallclair force-pushed the allocation-manager-2 branch from 4213a82 to 82ba31b Compare February 21, 2025 17:39
@tallclair tallclair force-pushed the allocation-manager-2 branch from 82ba31b to 9024140 Compare February 21, 2025 17:40
@tallclair
Copy link
Member Author

/retest

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: felipeagger, tallclair

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tallclair tallclair changed the title Move pod resource allocation management out of the status manager [FG:InPlacePodVerticalScaling] Move pod resource allocation management out of the status manager Feb 24, 2025
@tallclair
Copy link
Member Author

/test pull-kubernetes-integration
/test pull-kubernetes-e2e-gce-cos-alpha-features
/test pull-kubernetes-node-kubelet-podresize

@felipeagger
Copy link
Member

/lgtm

@k8s-ci-robot
Copy link
Contributor

@felipeagger: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@@ -1169,9 +1169,9 @@ func (kl *Kubelet) HandlePodCleanups(ctx context.Context) error {
// desired pods. Pods that must be restarted due to UID reuse, or leftover
// pods from previous runs, are not known to the pod worker.

allPodsByUID := make(map[types.UID]*v1.Pod)
allPodsByUID := make(sets.Set[types.UID])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: this was previously unused, so I repurposed it.

@bart0sh bart0sh moved this from Triage to Needs Reviewer in SIG Node: code and documentation PRs Feb 27, 2025
func (m *manager) DeletePodAllocation(uid types.UID) {
if err := m.state.Delete(string(uid), ""); err != nil {
// If the deletion fails, it will be retried by RemoveOrphanedPods, so we can safely ignore the error.
klog.V(3).ErrorS(err, "Failed to delete pod allocation", "podUID", uid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this didn't happen even once in CI
https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/130254/pull-kubernetes-e2e-kind/1892992974711689216/artifacts/kind-worker/kubelet.log

gsutil cat gs://kubernetes-ci-logs/pr-logs/pull/130254/pull-kubernetes-e2e-kind/1892992974711689216/artifacts/kind-worker/kubelet.log | grep 'Failed to delete pod allocation'

@BenTheElder
Copy link
Member

/lgtm
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 28, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 28, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e30f706d4bf01f3ca427d268b1eda3264172663f

@BenTheElder
Copy link
Member

/test pull-kubernetes-cmd

@k8s-ci-robot k8s-ci-robot merged commit 3560950 into kubernetes:master Feb 28, 2025
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Feb 28, 2025
@github-project-automation github-project-automation bot moved this from Needs Reviewer to Done in SIG Node: code and documentation PRs Feb 28, 2025
@esotsal
Copy link
Contributor

esotsal commented Mar 2, 2025

/lgtm thanks @tallclair

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Development

Successfully merging this pull request may close these issues.

8 participants