-
Notifications
You must be signed in to change notification settings - Fork 41.1k
DRA E2E: promote one test to conformance #133132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
test/e2e/dra/dra.go
Outdated
@@ -921,74 +1009,6 @@ var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), framework.With | |||
}).WithTimeout(f.Timeouts.PodDelete).Should(gomega.HaveField("Status.Allocation", (*resourceapi.AllocationResult)(nil))) | |||
}) | |||
|
|||
f.It("must be possible for the driver to update the ResourceClaim.Status.Devices once allocated", f.WithFeatureGate(features.DRAResourceClaimDeviceStatus), func(ctx context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't have been in singleNodeTests
because then it runs with and without kubelet. It gets moved to a better location such that it only runs once overall. I can also merge this separately.
Can we clarify that? It's usually on the PR author to provide the evidence of mapping the proposed tests to stable test results. Some of the tests in that link, which are tagged "ConformanceCandidate", are being skipped and not passing because they require 3+ schedule-able nodes (which would also be new for a conformance test, currently we require 2). |
Conformance also doesn't currently require that the control plane even exists as a node in any form, it seems like noise to tag most of these tests with "control plane with single node" in the name. Or else if these tests depend on there only being a single control plane node in some way, we should rewrite them to not depend on that sort of detail. Clusters should be equally conformant with HA control planes and single instance control planes and the exact details of how the apiserver is hosted are not important, just the behaviors of the API. You could host apiserver in a container in another cluster, on bare metal, or HA in the cloud, it should be the same to the conformance tests as long as the API behaviors are being respected. The tests should only need to schedule to worker nodes (2+) and use the API endpoint however it is running. |
I do not feel this has to go to conformance without more soak time of the code in GA, these tests were effectively tested with beta only until today and with kind clusters |
Clarify how? One link per tests that is proposed? Before I do that, let's get consensus on how many tests should get promoted, if any. We have different options:
Then those tests can't be conformance. I need to double-check, but there should be a reason why they can't work with less nodes.
I can probably remove that and still have unique test names. |
Not an option, all GA, non-optional features must have conformance tests.
Let's go with the minimal set for now, we can always layer in more later. This reduces risk for distros. What about DeviceClass? |
The PR has already been updated to promote just one test. I forgot to mention DeviceClass. It is also covered because all allocations require one. |
/test pull-kubernetes-integration Unrelated flake. |
All of these tests already ran before for a while in https://testgrid.k8s.io/sig-release-master-informing#kind-master-beta&include-filter-by-regex=DRA Now that DRA is GA, they also ran e.g. in https://testgrid.k8s.io/sig-release-master-blocking#kind-master&include-filter-by-regex=DRA&include-filter-by-regex=DRA https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&include-filter-by-regex=DRA For the sake of starting conservatively, only a single basic test which exercises ResourceSlice, ResourceClaimTemplate, ResourceClaim and ResourceClaim/Status gets promoted to a conformance test: dra/dra.go:747: [sig-node] [DRA] control plane supports claim and class parameters [Conformance] Documentation for most of the others gets added for future promotion, but they remain labeled as ConformanceCandidate: dra/dra.go:1993: [sig-node] [DRA] ResourceSlice Controller creates slices [ConformanceCandidate] dra/dra.go:2285: [sig-node] [DRA] control plane must apply per-node permission checks [ConformanceCandidate] dra/dra.go:1013: [sig-node] [DRA] control plane must deallocate after use [ConformanceCandidate] dra/dra.go:825: [sig-node] [DRA] control plane retries pod scheduling after creating device class [ConformanceCandidate] dra/dra.go:851: [sig-node] [DRA] control plane retries pod scheduling after updating device class [ConformanceCandidate] dra/dra.go:889: [sig-node] [DRA] control plane runs a pod without a generated resource claim [ConformanceCandidate] dra/dra.go:2149: [sig-node] [DRA] control plane supports count/resourceclaims.resource.k8s.io ResourceQuota [ConformanceCandidate] dra/dra.go:977: [sig-node] [DRA] control plane supports external claim referenced by multiple containers of multiple pods [ConformanceCandidate] dra/dra.go:957: [sig-node] [DRA] control plane supports external claim referenced by multiple pods [ConformanceCandidate] dra/dra.go:997: [sig-node] [DRA] control plane supports init containers [ConformanceCandidate] dra/dra.go:929: [sig-node] [DRA] control plane supports inline claim referenced by multiple containers [ConformanceCandidate] dra/dra.go:759: [sig-node] [DRA] control plane supports reusing resources [ConformanceCandidate] dra/dra.go:793: [sig-node] [DRA] control plane supports sharing a claim concurrently [ConformanceCandidate] dra/dra.go:942: [sig-node] [DRA] control plane supports simple pod referencing external resource claim [ConformanceCandidate] dra/dra.go:915: [sig-node] [DRA] control plane supports simple pod referencing inline resource claim [ConformanceCandidate] dra/dra.go:2137: [sig-node] [DRA] control plane truncates the name of a generated resource claim [ConformanceCandidate] dra/dra.go:1084: [sig-node] [DRA] control plane with different ResourceSlices keeps pod pending because of CEL runtime errors [ConformanceCandidate] dra/dra.go:1149: [sig-node] [DRA] control plane with node-local resources uses all resources [ConformanceCandidate]
441daaf
to
3b23974
Compare
/assign @johnbelamaric |
The test appears to be very flaky during pull request runs, apparently the pod that's getting created tends to be unschedulable. Failed run links from the past 24 hrs: |
Re: this release or next, even for a GA feature. Our policy states:
The reason for requiring the conformance tests before the deadline ordinarily is so we know that they are not flaky. This test is rather flaky: https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=supports%20reusing%20resources It's actually the flakiest DRA test in release-master blocking GCE default test suite: https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&width=20&include-filter-by-regex=DRA&sort-by-flakiness= And in fact the flakiest test in that job of any sort: This job is normally 100% green, excepting regressions. The flakes are these tests, not the CI / infra. Maybe one of the other tests cases is more appropriate? |
The single test to promote to conformance below does not seem to be flaky. The following test seems to be flaky, it is ConformanceCandidate, its flakiness needs to be fixed before promotion. |
@stlaz @BenTheElder could you please double check if the following test can be promoted to conformance? Kubernetes e2e suite.[It] [sig-node] [DRA] control plane [ConformanceCandidate] supports claim and class parameters |
"supports claim and class" only had one flake on GCE in recent history, it is flaking somewhat on a CAPZ job I don't recognize and am doubtful we should block on. https://storage.googleapis.com/k8s-triage/index.html?test=supports%20claim%20and%20class |
I also investigated the cause for the flakiness of the test: the test creates 10 pods at the same time, all of the 10 pods reuse the same device, which means each pod needs to be scheduled, then deleted, sequentially, to be able to reuse the same device. The sequential scheduling and deletion of pod takes time, currently, the wait timeout for all of the 10 pods to be scheduled is 80 seconds, which in some cases are not sufficient, as shown in the following log. At 2025-08-05 20:01:43, the first try of scheduling the pod has started, at 2025-08-05 20:03:07, the wait timed out. during this period, some pods have been scheduled, and deleted, but not all of the 10 pods. so increasing the wait timeout, or reducing the # of pods, should fix the test flakiness. https://prow.k8s.io/13b8c7b4-1951-4693-9393-83b4a3068d7d I0805 20:03:13.208285 9189 dump.go:53] At 2025-08-05 20:01:43 +0000 UTC - event for tester-10: {resource_claim } FailedResourceClaimCreation: PodResourceClaim my-inline-claim: resource claim template "tester-10": resourceclaimtemplate.resource.k8s.io "tester-10" not found |
10 seems arbitrary, perhaps we reduce to 3 or 4, which will reduce the runtime as well. |
#133397 reduced to 5. |
referencing the ResourceClaimTemplate. kube-scheduler must allocate a suitable | ||
device from a ResourceSlice and copy the parameters into the allocation result. | ||
kubelet must invoke a DRA driver such these parameters are active for the Pod | ||
(not part of conformance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems contradictory? How do we have a conformance test that self-discloses requirements of behavior that is not part of conformance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just documenting some expectation of how DRA should behave that is not tested, we should drop it from the description ...
If this is WRT the actual test behavior, we probably need to pick another test.
The test code is pretty heavily abstracted, it's not immediately apparent to my late night brain if this test depends on installing a driver / kubelet invoking a driver?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @yliaog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same test is used in two ways (withKubelet is false, and true), singleNodeTests := func(withKubelet bool)
for conformance test, withKubelet is false, hence the conformance test does not require kublet, the test checks if the pods are scheduled successfully, it does not check whether they are running or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TODO is to drop this from the conformance test to avoid confusion, but non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dims re: alternative test suggestion^
@@ -1972,7 +2103,9 @@ var _ = framework.SIGDescribe("node")(framework.WithLabel("DRA"), func() { | |||
b.TestPod(ctx, f, pod) | |||
}) | |||
|
|||
f.It("supports count/resourceclaims.resource.k8s.io ResourceQuota", framework.WithLabel("ConformanceCandidate") /* TODO: replace with framework.WithConformance() */, func(ctx context.Context) { | |||
// This could be required for conformance, but there have been some rare known flakes in the quota controller, so it's not considered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too bad :/
Do we have a tracking bug?
referencing the ResourceClaimTemplate. kube-scheduler must allocate a suitable | ||
device from a ResourceSlice and copy the parameters into the allocation result. | ||
kubelet must invoke a DRA driver such these parameters are active for the Pod | ||
(not part of conformance). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just documenting some expectation of how DRA should behave that is not tested, we should drop it from the description ...
If this is WRT the actual test behavior, we probably need to pick another test.
The test code is pretty heavily abstracted, it's not immediately apparent to my late night brain if this test depends on installing a driver / kubelet invoking a driver?
@BenTheElder this PR picks the following test to promote, the picked test is not flaky. the flaky test below is not promoted in this PR. |
/retest |
/retest |
2 similar comments
/retest |
/retest |
All cluster E2E is currently broken on master / 1.34 because apiserver panics on startup. |
/retest |
@BenTheElder @dims the PR has passed all tests, ready for review/approval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO: I have two TODOS that I think can be non-blocking but would be good to resolve:
-
Cleanup the test descriptions to not have conformance tests mention the details that are happening in the other variant of the same test that is not conformance. Just don't mention it, the mention is confusing.
-
Rewrite to avoid the local "framework" for duplicating the tests to conformance or not and "multi control plane" or not, it makes it harder to look at what this test is doing without local context and doesn't save much code.
In the 1.35 when the flakes have been resolved we can look at promoting at least also the other single test previously proposed.
WDYT @dims?
What type of PR is this?
/kind feature
What this PR does / why we need it:
All of these tests already ran before for a while in
https://testgrid.k8s.io/sig-release-master-informing#kind-master-beta&include-filter-by-regex=DRA
Now that DRA is GA, they also ran e.g. in
https://testgrid.k8s.io/sig-release-master-blocking#kind-master&include-filter-by-regex=DRA&include-filter-by-regex=DRA
https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-default&include-filter-by-regex=DRA
For the sake of starting conservatively, only a single basic test which exercises ResourceSlice, DeviceClass, ResourceClaimTemplate, ResourceClaim and ResourceClaim/Status gets promoted to a conformance test:
Documentation for most of the others gets added for future promotion, but they remain labeled as ConformanceCandidate:
Which issue(s) this PR is related to:
Fixes: #114183
KEP: kubernetes/enhancements#4381
Special notes for your reviewer:
/area conformance
cc @kubernetes/sig-architecture-pr-reviews @kubernetes/sig-node-pr-reviews @kubernetes/cncf-conformance-wg
Does this PR introduce a user-facing change?