Skip to content

comment out the device plugin test case: supports extended resources … #133286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yliaog
Copy link
Contributor

@yliaog yliaog commented Jul 29, 2025

…together with ResourceClaim

What type of PR is this?

/kind bug

What this PR does / why we need it:

comment out the device plugin test case: supports extended resources together with ResourceClaim, it conflicts with the other test case: "must run pods with extended resource on dra nodes and device plugin nodes"

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 29, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 29, 2025
@k8s-ci-robot k8s-ci-robot requested review from klueska and pohly July 29, 2025 20:11
@k8s-ci-robot k8s-ci-robot added area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jul 29, 2025
@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 29, 2025
@yliaog
Copy link
Contributor Author

yliaog commented Jul 29, 2025

/assign @pohly

@pohly
Copy link
Contributor

pohly commented Jul 29, 2025

In the PR description you claim that it conflicts, but that claim is not supported by an actual explanation why or how.

Let's investigate and find the root cause before making further changes.

/cc @johnbelamaric

@yliaog
Copy link
Contributor Author

yliaog commented Jul 29, 2025

the comment in the code provides more details on why there is such conflict

@yliaog
Copy link
Contributor Author

yliaog commented Jul 29, 2025

18 minutes after the test finishes, kubelet still publishes 'example.com/resource: "0"' in node.status.Capacity and node.status.Allocatable. I tried to remove it from node.status, but kubelet keeps putting it back. still need to figure out how to cleanly remove it.

while true; do date; kubectl get no -o yaml|grep example; sleep 5; done

Tue Jul 29 08:24:39 PM UTC 2025
example.com/resource: "0"
example.com/resource: "2"
... ... ...
Tue Jul 29 08:42:17 PM UTC 2025
example.com/resource: "0"
example.com/resource: "0"

@yliaog
Copy link
Contributor Author

yliaog commented Jul 29, 2025

ok, it failed again, now i need to keep looking into it.

@yliaog
Copy link
Contributor Author

yliaog commented Jul 29, 2025

/test pull-kubernetes-kind-dra-all

@yliaog
Copy link
Contributor Author

yliaog commented Jul 29, 2025

@pohly @johnbelamaric found out the root cause.

the feature "DRADeviceBindingConditions: true" does not work well with "DRAExtendedResources: true", i think DRADeviceBindingConditions was merged last night? I did not enable it when running the test locally.

360 2025-07-29T23:19:13.141476715Z stderr F panic: runtime error: invalid memory address or nil pointer dereference
361 2025-07-29T23:19:13.141506335Z stderr F [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x193c851]
k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.hasBindingConditions-range1(...)
365 2025-07-29T23:19:13.141546435Z stderr F k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go:1641
366 2025-07-29T23:19:13.141554475Z stderr F k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.hasBindingConditions.(*claimStore).all.func1(...) 367 2025-07-29T23:19:13.141564265Z stderr F k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/claims.go:72 368 2025-07-29T23:19:13.141584485Z stderr F k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.hasBindingConditions(0xc00051e5a0?) 369 2025-07-29T23:19:13.141592225Z stderr F k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go:1640 +0x51 370 2025-07-29T23:19:13.141610715Z stderr F k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources.(*DynamicResources).PreBind(0xc00051e5a0, {0x232c588, 0xc0007ef740}, {0x233b410?, 0xc00051f810?}, 0xc000d63688, {0xc000467d30, 0xc}) 371 2025-07-29T23:19:13.141632805Z stderr F k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources/dynamicresources.go:1312 +0x3d8

@pohly
Copy link
Contributor

pohly commented Jul 30, 2025

So this E2E failure was legitimate and #133290 fixes the problem, and was merged via #132522.

There's still some learnings here: removing the extended resource from node status doesn't really work because of the kubelet. We should update the code accordingly and replace patching the node status with waiting for the count to go to zero or (in case of a future kubelet enhancement) to not be present.

Let's use this PR to implement that? Or file an issue and close the PR.

@yliaog
Copy link
Contributor Author

yliaog commented Jul 30, 2025

we can keep using this PR since it has some context already, what needs to figure out is why kubelet keeping publishing the extended resource after device plugin has been removed.

@SergeyKanzhelev SergeyKanzhelev moved this from Triage to PRs Waiting on Author in SIG Node CI/Test Board Jul 30, 2025
@bart0sh bart0sh moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jul 31, 2025
@yliaog
Copy link
Contributor Author

yliaog commented Aug 12, 2025

#133488

@BenTheElder
Copy link
Member

Should we fix the leak/conflict instead?

These tests could also be marked as serial.

@yliaog
Copy link
Contributor Author

yliaog commented Aug 12, 2025

@BenTheElder we should fix the leak, but before that, it is better to comment out the other conflicting test, which was added for testing the deployDevicePlugin, which has already been tested in "must run pods with extended resource on dra nodes and device plugin nodes" test case.

they are already marked as Serial, but that would not help, since the leaked example.com/resouce in the cluster's nodes still interferes with the next test run in the same cluster.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: yliaog
Once this PR has been reviewed and has the lgtm label, please ask for approval from pohly. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yliaog
Copy link
Contributor Author

yliaog commented Aug 12, 2025

/test pull-kubernetes-unit

@yliaog
Copy link
Contributor Author

yliaog commented Aug 12, 2025

thet two test failures are not related to this PR, they can be ignored.

@BenTheElder
Copy link
Member

they are already marked as Serial, but that would not help, since the leaked example.com/resouce in the cluster's nodes still interferes with the next test run in the same cluster.

Ah ... that's pretty bad, would like to get that fixed.

We usually don't comment out tests, we e.g. [Flaky] or [Serial] or [Disruptive] tag them or delete them outright as appropriate. If deleted they will still be in git. If meant to be only temporarily disabled, Flaky etc works.

framework.WithFlaky() might be an appropriate short-term solution?

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 13, 2025
@yliaog
Copy link
Contributor Author

yliaog commented Aug 13, 2025

ok, added withFlaky() to temporarily disable the test

@k8s-ci-robot
Copy link
Contributor

@yliaog: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-dra-integration edfa9a5 link false /test pull-kubernetes-dra-integration
pull-kubernetes-kind-dra-n-1 edfa9a5 link false /test pull-kubernetes-kind-dra-n-1

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🆕 New
Status: PRs Waiting on Author
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

4 participants