Skip to content

DRA E2E: promote one test to conformance #133132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Jul 22, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

All of these tests already ran before for a while in
https://testgrid.k8s.io/sig-release-master-informing#kind-master-beta&include-filter-by-regex=DRA

Now that DRA is GA, they also ran e.g. in
https://testgrid.k8s.io/sig-release-master-blocking#kind-master&include-filter-by-regex=DRA&include-filter-by-regex=DRA
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&include-filter-by-regex=DRA

For the sake of starting conservatively, only a single basic test which exercises ResourceSlice, DeviceClass, ResourceClaimTemplate, ResourceClaim and ResourceClaim/Status gets promoted to a conformance test:

    dra/dra.go:747: [sig-node] [DRA] control plane supports claim and class parameters [Conformance]

Documentation for most of the others gets added for future promotion, but they remain labeled as ConformanceCandidate:

    dra/dra.go:1993: [sig-node] [DRA] ResourceSlice Controller creates slices [ConformanceCandidate]
    dra/dra.go:2285: [sig-node] [DRA] control plane must apply per-node permission checks [ConformanceCandidate]
    dra/dra.go:1013: [sig-node] [DRA] control plane must deallocate after use [ConformanceCandidate]
    dra/dra.go:825: [sig-node] [DRA] control plane retries pod scheduling after creating device class [ConformanceCandidate]
    dra/dra.go:851: [sig-node] [DRA] control plane retries pod scheduling after updating device class [ConformanceCandidate]
    dra/dra.go:889: [sig-node] [DRA] control plane runs a pod without a generated resource claim [ConformanceCandidate]
    dra/dra.go:2149: [sig-node] [DRA] control plane supports count/resourceclaims.resource.k8s.io ResourceQuota [ConformanceCandidate]
    dra/dra.go:977: [sig-node] [DRA] control plane supports external claim referenced by multiple containers of multiple pods [ConformanceCandidate]
    dra/dra.go:957: [sig-node] [DRA] control plane supports external claim referenced by multiple pods [ConformanceCandidate]
    dra/dra.go:997: [sig-node] [DRA] control plane supports init containers [ConformanceCandidate]
    dra/dra.go:929: [sig-node] [DRA] control plane supports inline claim referenced by multiple containers [ConformanceCandidate]
    dra/dra.go:759: [sig-node] [DRA] control plane supports reusing resources [ConformanceCandidate]
    dra/dra.go:793: [sig-node] [DRA] control plane supports sharing a claim concurrently [ConformanceCandidate]
    dra/dra.go:942: [sig-node] [DRA] control plane supports simple pod referencing external resource claim [ConformanceCandidate]
    dra/dra.go:915: [sig-node] [DRA] control plane supports simple pod referencing inline resource claim [ConformanceCandidate]
    dra/dra.go:2137: [sig-node] [DRA] control plane truncates the name of a generated resource claim [ConformanceCandidate]
    dra/dra.go:1084: [sig-node] [DRA] control plane with different ResourceSlices keeps pod pending because of CEL runtime errors [ConformanceCandidate]
    dra/dra.go:1149: [sig-node] [DRA] control plane with node-local resources uses all resources [ConformanceCandidate]

Which issue(s) this PR is related to:

Fixes: #114183
KEP: kubernetes/enhancements#4381

Special notes for your reviewer:

/area conformance
cc @kubernetes/sig-architecture-pr-reviews @kubernetes/sig-node-pr-reviews @kubernetes/cncf-conformance-wg

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. area/conformance Issues or PRs related to kubernetes conformance tests sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jul 22, 2025
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 22, 2025
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 22, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels Jul 22, 2025
@k8s-ci-robot k8s-ci-robot requested review from bart0sh and klueska July 22, 2025 20:07
@@ -921,74 +1009,6 @@ var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), framework.With
}).WithTimeout(f.Timeouts.PodDelete).Should(gomega.HaveField("Status.Allocation", (*resourceapi.AllocationResult)(nil)))
})

f.It("must be possible for the driver to update the ResourceClaim.Status.Devices once allocated", f.WithFeatureGate(features.DRAResourceClaimDeviceStatus), func(ctx context.Context) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't have been in singleNodeTests because then it runs with and without kubelet. It gets moved to a better location such that it only runs once overall. I can also merge this separately.

@BenTheElder
Copy link
Member

All of these tests already ran before for a while (possibly under slightly different names or labels) in:
https://testgrid.k8s.io/sig-release-master-informing#kind-master-beta&include-filter-by-regex=DRA

Can we clarify that? It's usually on the PR author to provide the evidence of mapping the proposed tests to stable test results.

Some of the tests in that link, which are tagged "ConformanceCandidate", are being skipped and not passing because they require 3+ schedule-able nodes (which would also be new for a conformance test, currently we require 2).

@BenTheElder
Copy link
Member

BenTheElder commented Jul 22, 2025

Conformance also doesn't currently require that the control plane even exists as a node in any form, it seems like noise to tag most of these tests with "control plane with single node" in the name. Or else if these tests depend on there only being a single control plane node in some way, we should rewrite them to not depend on that sort of detail.

Clusters should be equally conformant with HA control planes and single instance control planes and the exact details of how the apiserver is hosted are not important, just the behaviors of the API.

You could host apiserver in a container in another cluster, on bare metal, or HA in the cloud, it should be the same to the conformance tests as long as the API behaviors are being respected. The tests should only need to schedule to worker nodes (2+) and use the API endpoint however it is running.

@aojea
Copy link
Member

aojea commented Jul 22, 2025

I do not feel this has to go to conformance without more soak time of the code in GA, these tests were effectively tested with beta only until today and with kind clusters

@pohly
Copy link
Contributor Author

pohly commented Jul 23, 2025

Can we clarify that? It's usually on the PR author to provide the evidence of mapping the proposed tests to stable test results.

Clarify how? One link per tests that is proposed?

Before I do that, let's get consensus on how many tests should get promoted, if any. We have different options:

  • wait a release (@aojea)
  • all that can be promoted, as long as they can be proven stable (my current content of the PR, but I have no real preference)
  • just enough to hit the new endpoints (ResourceClaim, ResourceClaim/Status, ResourceClaimTemplate, ResourceSlice) - that could be a single test.

Some of the tests in that link, which are tagged "ConformanceCandidate", are being skipped and not passing because they require 3+ schedule-able nodes (which would also be new for a conformance test, currently we require 2).

Then those tests can't be conformance. I need to double-check, but there should be a reason why they can't work with less nodes.

it seems like noise to tag most of these tests with "control plane with single node" in the name

I can probably remove that and still have unique test names.

@pohly pohly marked this pull request as draft July 23, 2025 07:05
@pohly pohly mentioned this pull request Jul 23, 2025
@johnbelamaric
Copy link
Member

Not an option, all GA, non-optional features must have conformance tests.

  • just enough to hit the new endpoints (ResourceClaim, ResourceClaim/Status, ResourceClaimTemplate, ResourceSlice) - that could be a single test.

Let's go with the minimal set for now, we can always layer in more later. This reduces risk for distros. What about DeviceClass?

@pohly
Copy link
Contributor Author

pohly commented Jul 24, 2025

Let's go with the minimal set for now, we can always layer in more later. This reduces risk for distros. What about DeviceClass?

The PR has already been updated to promote just one test.

I forgot to mention DeviceClass. It is also covered because all allocations require one.

@pohly
Copy link
Contributor Author

pohly commented Jul 24, 2025

/test pull-kubernetes-integration

Unrelated flake.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2025
All of these tests already ran before for a while in
    https://testgrid.k8s.io/sig-release-master-informing#kind-master-beta&include-filter-by-regex=DRA

Now that DRA is GA, they also ran e.g. in
    https://testgrid.k8s.io/sig-release-master-blocking#kind-master&include-filter-by-regex=DRA&include-filter-by-regex=DRA
    https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&include-filter-by-regex=DRA

For the sake of starting conservatively, only a single basic test which
exercises ResourceSlice, ResourceClaimTemplate, ResourceClaim and
ResourceClaim/Status gets promoted to a conformance test:

    dra/dra.go:747: [sig-node] [DRA] control plane supports claim and class parameters [Conformance]

Documentation for most of the others gets added for future promotion, but they
remain labeled as ConformanceCandidate:

    dra/dra.go:1993: [sig-node] [DRA] ResourceSlice Controller creates slices [ConformanceCandidate]
    dra/dra.go:2285: [sig-node] [DRA] control plane must apply per-node permission checks [ConformanceCandidate]
    dra/dra.go:1013: [sig-node] [DRA] control plane must deallocate after use [ConformanceCandidate]
    dra/dra.go:825: [sig-node] [DRA] control plane retries pod scheduling after creating device class [ConformanceCandidate]
    dra/dra.go:851: [sig-node] [DRA] control plane retries pod scheduling after updating device class [ConformanceCandidate]
    dra/dra.go:889: [sig-node] [DRA] control plane runs a pod without a generated resource claim [ConformanceCandidate]
    dra/dra.go:2149: [sig-node] [DRA] control plane supports count/resourceclaims.resource.k8s.io ResourceQuota [ConformanceCandidate]
    dra/dra.go:977: [sig-node] [DRA] control plane supports external claim referenced by multiple containers of multiple pods [ConformanceCandidate]
    dra/dra.go:957: [sig-node] [DRA] control plane supports external claim referenced by multiple pods [ConformanceCandidate]
    dra/dra.go:997: [sig-node] [DRA] control plane supports init containers [ConformanceCandidate]
    dra/dra.go:929: [sig-node] [DRA] control plane supports inline claim referenced by multiple containers [ConformanceCandidate]
    dra/dra.go:759: [sig-node] [DRA] control plane supports reusing resources [ConformanceCandidate]
    dra/dra.go:793: [sig-node] [DRA] control plane supports sharing a claim concurrently [ConformanceCandidate]
    dra/dra.go:942: [sig-node] [DRA] control plane supports simple pod referencing external resource claim [ConformanceCandidate]
    dra/dra.go:915: [sig-node] [DRA] control plane supports simple pod referencing inline resource claim [ConformanceCandidate]
    dra/dra.go:2137: [sig-node] [DRA] control plane truncates the name of a generated resource claim [ConformanceCandidate]
    dra/dra.go:1084: [sig-node] [DRA] control plane with different ResourceSlices keeps pod pending because of CEL runtime errors [ConformanceCandidate]
    dra/dra.go:1149: [sig-node] [DRA] control plane with node-local resources uses all resources [ConformanceCandidate]
@pohly pohly force-pushed the dra-e2e-conformance branch from 441daaf to 3b23974 Compare July 28, 2025 20:28
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2025
@pohly
Copy link
Contributor Author

pohly commented Jul 28, 2025

/assign @johnbelamaric

@stlaz
Copy link
Member

stlaz commented Aug 5, 2025

The test appears to be very flaky during pull request runs, apparently the pod that's getting created tends to be unschedulable.
@pohly could you have a look?

Failed run links from the past 24 hrs:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/131260/pull-kubernetes-e2e-gce/1952515411312381952
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/133373/pull-kubernetes-e2e-gce/1952421018924486656
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/132812/pull-kubernetes-e2e-gce/1952320607337058304

@BenTheElder
Copy link
Member

Re: this release or next, even for a GA feature.

Our policy states:
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md

it is stable and runs consistently (e.g., no flakes), and has been running for at least two weeks

new conformance tests or updates to conformance tests for additional scenarios are only allowed before code freeze dates set by the release team to allow enough soak time of the changes and gives folks a chance to kick the tires either in the community CI or their own infrastructure to make sure the tests are robust

The reason for requiring the conformance tests before the deadline ordinarily is so we know that they are not flaky.
I think it makes sense to be flexible about the deadline aspect, but not the flakiness aspect.

This test is rather flaky:

https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=supports%20reusing%20resources
Screenshot 2025-08-05 at 7 33 38 AM

It's actually the flakiest DRA test in release-master blocking GCE default test suite:

https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=20&include-filter-by-regex=DRA&sort-by-flakiness=
Screenshot 2025-08-05 at 7 32 20 AM

And in fact the flakiest test in that job of any sort:
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=20&&sort-by-flakiness=

This job is normally 100% green, excepting regressions. The flakes are these tests, not the CI / infra.

Maybe one of the other tests cases is more appropriate?

@yliaog
Copy link
Contributor

yliaog commented Aug 5, 2025

The single test to promote to conformance below does not seem to be flaky.
Kubernetes e2e suite.[It] [sig-node] [DRA] control plane [ConformanceCandidate] supports claim and class parameters

The following test seems to be flaky, it is ConformanceCandidate, its flakiness needs to be fixed before promotion.
Kubernetes e2e suite.[It] [sig-node] [DRA] control plane [ConformanceCandidate] supports reusing resources

@yliaog
Copy link
Contributor

yliaog commented Aug 5, 2025

@stlaz @BenTheElder could you please double check if the following test can be promoted to conformance?

Kubernetes e2e suite.[It] [sig-node] [DRA] control plane [ConformanceCandidate] supports claim and class parameters

@BenTheElder
Copy link
Member

"supports claim and class" only had one flake on GCE in recent history, it is flaking somewhat on a CAPZ job I don't recognize and am doubtful we should block on.

https://storage.googleapis.com/k8s-triage/index.html?test=supports%20claim%20and%20class

@yliaog
Copy link
Contributor

yliaog commented Aug 5, 2025

I also investigated the cause for the flakiness of the test:
Kubernetes e2e suite.[It] [sig-node] [DRA] control plane [ConformanceCandidate] supports reusing resources

the test creates 10 pods at the same time, all of the 10 pods reuse the same device, which means each pod needs to be scheduled, then deleted, sequentially, to be able to reuse the same device. The sequential scheduling and deletion of pod takes time, currently, the wait timeout for all of the 10 pods to be scheduled is 80 seconds, which in some cases are not sufficient, as shown in the following log.

At 2025-08-05 20:01:43, the first try of scheduling the pod has started, at 2025-08-05 20:03:07, the wait timed out. during this period, some pods have been scheduled, and deleted, but not all of the 10 pods. so increasing the wait timeout, or reducing the # of pods, should fix the test flakiness.

https://prow.k8s.io/13b8c7b4-1951-4693-9393-83b4a3068d7d

I0805 20:03:13.208285 9189 dump.go:53] At 2025-08-05 20:01:43 +0000 UTC - event for tester-10: {resource_claim } FailedResourceClaimCreation: PodResourceClaim my-inline-claim: resource claim template "tester-10": resourceclaimtemplate.resource.k8s.io "tester-10" not found
I0805 20:03:13.208289 9189 dump.go:53] At 2025-08-05 20:01:50 +0000 UTC - event for tester-1: {default-scheduler } FailedScheduling: 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 1 timed out trying to allocate devices, 2 cannot allocate all claims. still not schedulable, preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
I0805 20:03:13.208292 9189 dump.go:53] At 2025-08-05 20:01:59 +0000 UTC - event for tester-2: {default-scheduler } Scheduled: Successfully assigned dra-1246/tester-2 to bootstrap-e2e-minion-group-rk4q
I0805 20:03:13.208295 9189 dump.go:53] At 2025-08-05 20:01:59 +0000 UTC - event for tester-2: {kubelet bootstrap-e2e-minion-group-rk4q} FailedPrepareDynamicResources: Failed to prepare dynamic resources: prepare dynamic resources: DRA driver dra-1246.k8s.io is not registered
I0805 20:03:13.208297 9189 dump.go:53] At 2025-08-05 20:02:07 +0000 UTC - event for tester-3: {default-scheduler } FailedScheduling: 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 cannot allocate all claims. still not schedulable, preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
I0805 20:03:13.208300 9189 dump.go:53] At 2025-08-05 20:02:13 +0000 UTC - event for tester-4: {default-scheduler } Scheduled: Successfully assigned dra-1246/tester-4 to bootstrap-e2e-minion-group-rk4q
I0805 20:03:13.208311 9189 dump.go:53] At 2025-08-05 20:02:18 +0000 UTC - event for tester-5: {default-scheduler } FailedScheduling: 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 cannot allocate all claims. still not schedulable, preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
I0805 20:03:13.208314 9189 dump.go:53] At 2025-08-05 20:02:25 +0000 UTC - event for tester-6: {default-scheduler } Scheduled: Successfully assigned dra-1246/tester-6 to bootstrap-e2e-minion-group-rk4q
I0805 20:03:13.208317 9189 dump.go:53] At 2025-08-05 20:02:26 +0000 UTC - event for tester-6: {kubelet bootstrap-e2e-minion-group-rk4q} FailedPrepareDynamicResources: Failed to prepare dynamic resources: prepare dynamic resources: DRA driver dra-1246.k8s.io is not registered
I0805 20:03:13.208321 9189 dump.go:53] At 2025-08-05 20:02:30 +0000 UTC - event for tester-7: {default-scheduler } FailedScheduling: 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 cannot allocate all claims. still not schedulable, preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
I0805 20:03:13.208326 9189 dump.go:53] At 2025-08-05 20:02:35 +0000 UTC - event for tester-8: {default-scheduler } Scheduled: Successfully assigned dra-1246/tester-8 to bootstrap-e2e-minion-group-rk4q
I0805 20:03:13.208329 9189 dump.go:53] At 2025-08-05 20:02:36 +0000 UTC - event for tester-8: {kubelet bootstrap-e2e-minion-group-rk4q} FailedPrepareDynamicResources: Failed to prepare dynamic resources: prepare dynamic resources: DRA driver dra-1246.k8s.io is not registered
I0805 20:03:13.208332 9189 dump.go:53] At 2025-08-05 20:02:39 +0000 UTC - event for tester-9: {default-scheduler } FailedScheduling: 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 cannot allocate all claims. still not schedulable, preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
I0805 20:03:13.208414 9189 dump.go:53] At 2025-08-05 20:02:42 +0000 UTC - event for tester-10: {default-scheduler } FailedScheduling: 0/4 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 3 cannot allocate all claims. still not schedulable, preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
I0805 20:03:13.208421 9189 dump.go:53] At 2025-08-05 20:02:46 +0000 UTC - event for tester-1: {default-scheduler } Scheduled: Successfully assigned dra-1246/tester-1 to bootstrap-e2e-minion-group-rk4q
I0805 20:03:13.208424 9189 dump.go:53] At 2025-08-05 20:02:47 +0000 UTC - event for tester-1: {kubelet bootstrap-e2e-minion-group-rk4q} FailedPrepareDynamicResources: Failed to prepare dynamic resources: prepare dynamic resources: DRA driver dra-1246.k8s.io is not registered
I0805 20:03:13.208427 9189 dump.go:53] At 2025-08-05 20:02:53 +0000 UTC - event for tester-7: {default-scheduler } Scheduled: Successfully assigned dra-1246/tester-7 to bootstrap-e2e-minion-group-rk4q
I0805 20:03:13.208429 9189 dump.go:53] At 2025-08-05 20:02:54 +0000 UTC - event for tester-7: {kubelet bootstrap-e2e-minion-group-rk4q} FailedPrepareDynamicResources: Failed to prepare dynamic resources: prepare dynamic resources: DRA driver dra-1246.k8s.io is not registered
I0805 20:03:13.208432 9189 dump.go:53] At 2025-08-05 20:03:00 +0000 UTC - event for tester-5: {default-scheduler } Scheduled: Successfully assigned dra-1246/tester-5 to bootstrap-e2e-minion-group-rk4q
I0805 20:03:13.208436 9189 dump.go:53] At 2025-08-05 20:03:01 +0000 UTC - event for tester-5: {kubelet bootstrap-e2e-minion-group-rk4q} FailedPrepareDynamicResources: Failed to prepare dynamic resources: prepare dynamic resources: DRA driver dra-1246.k8s.io is not registered
I0805 20:03:13.208439 9189 dump.go:53] At 2025-08-05 20:03:06 +0000 UTC - event for tester-9: {default-scheduler } FailedScheduling: skip schedule deleting pod: dra-1246/tester-9
I0805 20:03:13.208442 9189 dump.go:53] At 2025-08-05 20:03:07 +0000 UTC - event for tester-3: {default-scheduler } FailedScheduling: running Bind plugin "DefaultBinder": pods "tester-3" not found
I0805 20:03:13.208445 9189 dump.go:53] At 2025-08-05 20:03:09 +0000 UTC - event for dra-test-driver: {replicaset-controller } SuccessfulDelete: Deleted pod: dra-test-driver-2m8p9
I0805 20:03:13.208448 9189 dump.go:53] At 2025-08-05 20:03:09 +0000 UTC - event for dra-test-driver-2m8p9: {kubelet bootstrap-e2e-minion-group-rk4q} Killing: Stopping container pause

@johnbelamaric
Copy link
Member

the test creates 10 pods at the same time, all of the 10 pods reuse the same device, which means each pod needs to be scheduled, then deleted, sequentially, to be able to reuse the same device. The sequential scheduling and deletion of pod takes time, currently, the wait timeout for all of the 10 pods to be scheduled is 80 seconds, which in some cases are not sufficient, as shown in the following log.

10 seems arbitrary, perhaps we reduce to 3 or 4, which will reduce the runtime as well.

@yliaog
Copy link
Contributor

yliaog commented Aug 6, 2025

#133397 reduced to 5.

referencing the ResourceClaimTemplate. kube-scheduler must allocate a suitable
device from a ResourceSlice and copy the parameters into the allocation result.
kubelet must invoke a DRA driver such these parameters are active for the Pod
(not part of conformance).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems contradictory? How do we have a conformance test that self-discloses requirements of behavior that is not part of conformance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just documenting some expectation of how DRA should behave that is not tested, we should drop it from the description ...

If this is WRT the actual test behavior, we probably need to pick another test.

The test code is pretty heavily abstracted, it's not immediately apparent to my late night brain if this test depends on installing a driver / kubelet invoking a driver?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @yliaog

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same test is used in two ways (withKubelet is false, and true), singleNodeTests := func(withKubelet bool)

for conformance test, withKubelet is false, hence the conformance test does not require kublet, the test checks if the pods are scheduled successfully, it does not check whether they are running or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think TODO is to drop this from the conformance test to avoid confusion, but non-blocking.

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dims re: alternative test suggestion^

@@ -1972,7 +2103,9 @@ var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), func() {
b.TestPod(ctx, f, pod)
})

f.It("supports count/resourceclaims.resource.k8s.io ResourceQuota", framework.WithLabel("ConformanceCandidate") /* TODO: replace with framework.WithConformance() */, func(ctx context.Context) {
// This could be required for conformance, but there have been some rare known flakes in the quota controller, so it's not considered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad :/

Do we have a tracking bug?

referencing the ResourceClaimTemplate. kube-scheduler must allocate a suitable
device from a ResourceSlice and copy the parameters into the allocation result.
kubelet must invoke a DRA driver such these parameters are active for the Pod
(not part of conformance).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just documenting some expectation of how DRA should behave that is not tested, we should drop it from the description ...

If this is WRT the actual test behavior, we probably need to pick another test.

The test code is pretty heavily abstracted, it's not immediately apparent to my late night brain if this test depends on installing a driver / kubelet invoking a driver?

@yliaog
Copy link
Contributor

yliaog commented Aug 6, 2025

@BenTheElder this PR picks the following test to promote, the picked test is not flaky.
Kubernetes e2e suite.[It] [sig-node] [DRA] control plane [ConformanceCandidate] supports claim and class parameters

the flaky test below is not promoted in this PR.
Kubernetes e2e suite.[It] [sig-node] [DRA] control plane [ConformanceCandidate] supports reusing resources

@BenTheElder
Copy link
Member

/retest
these are mostly the image pulling test that was broken

@yliaog
Copy link
Contributor

yliaog commented Aug 6, 2025

/retest

2 similar comments
@yliaog
Copy link
Contributor

yliaog commented Aug 6, 2025

/retest

@yliaog
Copy link
Contributor

yliaog commented Aug 6, 2025

/retest

@BenTheElder
Copy link
Member

All cluster E2E is currently broken on master / 1.34 because apiserver panics on startup.

#133405

@yliaog
Copy link
Contributor

yliaog commented Aug 7, 2025

/retest

@yliaog
Copy link
Contributor

yliaog commented Aug 7, 2025

@BenTheElder @dims the PR has passed all tests, ready for review/approval

Copy link
Member

@BenTheElder BenTheElder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO: I have two TODOS that I think can be non-blocking but would be good to resolve:

  1. Cleanup the test descriptions to not have conformance tests mention the details that are happening in the other variant of the same test that is not conformance. Just don't mention it, the mention is confusing.

  2. Rewrite to avoid the local "framework" for duplicating the tests to conformance or not and "multi control plane" or not, it makes it harder to look at what this test is doing without local context and doesn't save much code.

In the 1.35 when the flakes have been resolved we can look at promoting at least also the other single test previously proposed.

WDYT @dims?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/conformance Issues or PRs related to kubernetes conformance tests area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 🏗 In progress
Status: PRs Waiting on Author
Development

Successfully merging this pull request may close these issues.

dynamic resource allocation: create conformance test
7 participants