Skip to content

Add Kubelet stress test for pod cleanup when rejection due to VolumeAttachmentLimitExceeded #133357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

torredil
Copy link
Member

@torredil torredil commented Aug 1, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds a stress test to validate pod cleanup behavior when SyncPod hits the VolumeAttachmentLimitExceeded error path in WaitForAttachAndMount. The key assertions being validated are that all 500 pods reach Phase=Failed with reason VolumeAttachmentLimitExceeded, confirming that admission fails as intended. Additionally, the test checks that SyncTerminatedPod completes successfully for each pod (we therefore know that kubelet executes the full teardown flow for every rejected pod; volumes unmounted, cgroup destroyed, etc). Finally, the test checks that allocated resources (as managed by allocationManager) to pod are cleaned up.

Which issue(s) this PR is related to:

Fixes #133188

Special notes for your reviewer: See #132933 (comment) for context.

Does this PR introduce a user-facing change?

NONE.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

N/A

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 1, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Aug 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: torredil
Once this PR has been reviewed and has the lgtm label, please assign random-liu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@torredil
Copy link
Member Author

torredil commented Aug 1, 2025

/sig node
/sig storage

@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. labels Aug 1, 2025
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Aug 1, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Aug 1, 2025
…LimitExceeded

Signed-off-by: Eddie Torres <torredil@amazon.com>
@SergeyKanzhelev
Copy link
Member

/area test

this test is useful. I would also urge to make some e2e test as e2e may catch some issues that are hard to repro with unit tests. I am not 100% sure we will need unit test stress if we can do e2e stress instead

@torredil
Copy link
Member Author

torredil commented Aug 6, 2025

@SergeyKanzhelev In the first iteration of this PR I was actually writing this stress test as an e2e test similar to

f.It("should transition pod to failed state when attachment limit exceeded", func(ctx context.Context) {
but scaled up - then I realized we wouldn't have visibility into kubelet internals (podWorkers, statusManager, allocationManager) as we do in this test (to validate SyncTerminated pod completes successfully, allocated resources are cleaned up, etc).

Important to note that the test intentionally uses a real instance of podWorkers. I think the code coverage gained from this test is valuable and worth merging. I'm happy to explore the e2e path once more if you recommend it 👍

@SergeyKanzhelev
Copy link
Member

then I realized we wouldn't have visibility into kubelet internals (podWorkers, statusManager, allocationManager) as we do in this test (to validate SyncTerminated pod completes successfully, allocated resources are cleaned up, etc).

The thing e2e will validate that the pod admission will keep breaking at the same stage and not on some new condition like IP exhaustion.

@SergeyKanzhelev
Copy link
Member

I think the code coverage gained from this test is valuable and worth merging.

We will triage next week in SIG Node CI meeting and find a reviewer, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Development

Successfully merging this pull request may close these issues.

Add Kubelet stress test for rejecting pods when the attachment limit is exceeded
3 participants