WIP: check invariant metrics after e2e tests #133394

BenTheElder · 2025-08-06T00:43:38Z

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-08-06T00:43:41Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-08-06T00:43:48Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-08-06T00:43:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [BenTheElder]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

aojea · 2025-08-06T04:31:30Z

test/e2e/e2e_invariants.go

+	for _, sample := range samples {
+		// TODO: this logic is intentionally inverted to showcase the results of a failure
+		// It should be > 0 instead
+		if sample.Value <= 0 {


This will assume invariant means is always 0 that I think is correct, but I just wanted to know if we cover all use cases with this assumption

YAGNI?

the two we have are "should remain 0", but if a new one is added that requires a different check we can review a refactor.

I want to keep the implementation as simple as possible to avoid flakes.

(This is also a dummy currently for testing purposes per the comment, will follow up on this particular line)

aojea · 2025-08-06T04:32:17Z

test/e2e/e2e_invariants.go

+	}
+	apiserverMetrics, err := grabber.GrabFromAPIServer(ctx)
+	if err != nil {
+		framework.Failf("error grabbing api-server metrics: %v", err)


should we fail if we do not get the invariants or just log ?

We actually already have default tests that require fetching metrics with the grabber including apiserver specifically and they seem to be reliable, it is a very simple API call.

I think it is better to fail and detect this unless it is proven to be unreliable. In which case we can either switch to logging or mark the test as flaky until it is improved.

But I don't see any reason for this to fail other than an unhealthy API server.

If you mean should we fail if the specific metrics are missing in the response, I think so as well, that would imply a bug.

Added an explicit check for if we were able to read the metric being tested versus the empty case for samples.

BenTheElder · 2025-08-06T08:35:02Z

So this intentionally has the wrong logic for the metric value currently to ensure that it will fail and surface the results in CI:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/133394/pull-kubernetes-e2e-kind/1953004021803388928

We can also see that it does NOT fail in the jobs specifically selecting conformance, because the "test" that enables checking the metrics at the end is skipped and checking the metrics is therefore skipped:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/133394/pull-kubernetes-conformance-kind-ga-only-parallel/1953004022029881344

When all the tests have completed, and the right jobs have passed and failed as expected, and we're otherwise happy with the shape, I'll fix the actual metric check line to be correct (there is a TODO line here #133394 (comment))

BenTheElder · 2025-08-06T08:38:26Z

[Invoking additional optional jobs for clarity]

/test pull-kubernetes-conformance-kind-ga-only
/test pull-kubernetes-e2e-kind-alpha-features
/test pull-kubernetes-e2e-kind-beta-features
/test pull-kubernetes-e2e-kind-alpha-beta-features

aojea · 2025-08-06T13:58:38Z

test/e2e/OWNERS

+  # special checks that run after the entire suite
+  # must be specially reviewed
+  "e2e_invariants.go":
+    approvers:


Does this guarantee that top level approver are not able to approve?

No, but this config guarantees auto assignment will go to this group, and there are comments throughout that you must ask.

if we want to enforce that we can refactor a bit to put the key details in a subdirectory and use no_parent_owners.

We already have no_parent_owners at the test/ level but there are a lot of people listed there.
Anyone there could already approve arbitrary changes to the test binary to add something like this, so that level of paranoia seemed unnecessary.

aojea · 2025-08-06T13:59:35Z

test/e2e/e2e_invariants.go

+	// this allows us to run it by default in most jobs, but it can be opted-out,
+	// does not run when selecting Conformance, and it can be tagged Flaky
+	// if we encounter issues with it
+	ginkgo.It(invariantsLeafText, func() {})


Should we also use label for filters?

We can add it as needed, but I didn't see a need yet. Did you have any in mind?

My goal is that it should run by default in most jobs, but not conformance (see which jobs just passed and failed with the check being rigged to fail currently), but it can be labeled flaky if we find problems, and any job can also filter it out at least by SKIP=invariants.

BenTheElder · 2025-08-06T14:41:33Z

Will debug those verify checks. Otherwise we see that the conformance jobs didn't run the actual metrics grabbing and the other jobs did.

We can also see no flakes so far, just the intentional failure at the end with the temporarily rigged inverted metric value assertion.

BenTheElder · 2025-08-06T19:43:28Z

The verify failures required adjusting to detect dry-run mode and skip the invariant metrics check in that case.

That's fixed now.

k8s-ci-robot · 2025-08-06T20:30:17Z

@BenTheElder: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kubernetes-e2e-kind-alpha-features	`65ee329`	link	false	`/test pull-kubernetes-e2e-kind-alpha-features`
pull-kubernetes-e2e-kind-alpha-beta-features	`65ee329`	link	false	`/test pull-kubernetes-e2e-kind-alpha-beta-features`
pull-kubernetes-e2e-kind-beta-features	`65ee329`	link	false	`/test pull-kubernetes-e2e-kind-beta-features`
pull-kubernetes-e2e-kind	`56385de`	link	true	`/test pull-kubernetes-e2e-kind`
pull-kubernetes-e2e-gce	`56385de`	link	true	`/test pull-kubernetes-e2e-gce`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Aug 6, 2025

k8s-ci-robot requested review from dchen1107 and thockin August 6, 2025 00:44

aojea reviewed Aug 6, 2025

View reviewed changes

BenTheElder force-pushed the invariants branch 3 times, most recently from b2874c0 to 65ee329 Compare August 6, 2025 08:03

aojea reviewed Aug 6, 2025

View reviewed changes

This was referenced Aug 6, 2025

Invariant Signal Collection for Kubernetes Testing kubernetes/enhancements#5196

Closed

feat: increment an internal metric when duplicate validation errors are found #132613

Open

WIP: check invariant metrics after e2e tests

56385de

BenTheElder force-pushed the invariants branch from 65ee329 to 56385de Compare August 6, 2025 19:42

BenTheElder mentioned this pull request Aug 6, 2025

Invariant Testing kubernetes/enhancements#5468

Open

4 tasks

WIP: check invariant metrics after e2e tests #133394

Are you sure you want to change the base?

WIP: check invariant metrics after e2e tests #133394

Uh oh!

Conversation

BenTheElder commented Aug 6, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BenTheElder commented Aug 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BenTheElder commented Aug 6, 2025

Uh oh!

BenTheElder commented Aug 6, 2025

Uh oh!

k8s-ci-robot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

BenTheElder commented Aug 6, 2025 •

edited

Loading

k8s-ci-robot commented Aug 6, 2025 •

edited

Loading