Skip to content

WIP: check invariant metrics after e2e tests #133394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

BenTheElder
Copy link
Member

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 6, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Aug 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Aug 6, 2025
for _, sample := range samples {
// TODO: this logic is intentionally inverted to showcase the results of a failure
// It should be > 0 instead
if sample.Value <= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will assume invariant means is always 0 that I think is correct, but I just wanted to know if we cover all use cases with this assumption

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YAGNI?

the two we have are "should remain 0", but if a new one is added that requires a different check we can review a refactor.

I want to keep the implementation as simple as possible to avoid flakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This is also a dummy currently for testing purposes per the comment, will follow up on this particular line)

}
apiserverMetrics, err := grabber.GrabFromAPIServer(ctx)
if err != nil {
framework.Failf("error grabbing api-server metrics: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fail if we do not get the invariants or just log ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already have default tests that require fetching metrics with the grabber including apiserver specifically and they seem to be reliable, it is a very simple API call.

I think it is better to fail and detect this unless it is proven to be unreliable. In which case we can either switch to logging or mark the test as flaky until it is improved.

But I don't see any reason for this to fail other than an unhealthy API server.

If you mean should we fail if the specific metrics are missing in the response, I think so as well, that would imply a bug.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit check for if we were able to read the metric being tested versus the empty case for samples.

@BenTheElder BenTheElder force-pushed the invariants branch 3 times, most recently from b2874c0 to 65ee329 Compare August 6, 2025 08:03
@BenTheElder
Copy link
Member Author

BenTheElder commented Aug 6, 2025

So this intentionally has the wrong logic for the metric value currently to ensure that it will fail and surface the results in CI:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/133394/pull-kubernetes-e2e-kind/1953004021803388928

Screenshot 2025-08-06 at 1 32 03 AM

We can also see that it does NOT fail in the jobs specifically selecting conformance, because the "test" that enables checking the metrics at the end is skipped and checking the metrics is therefore skipped:

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/133394/pull-kubernetes-conformance-kind-ga-only-parallel/1953004022029881344

When all the tests have completed, and the right jobs have passed and failed as expected, and we're otherwise happy with the shape, I'll fix the actual metric check line to be correct (there is a TODO line here #133394 (comment))

@BenTheElder
Copy link
Member Author

[Invoking additional optional jobs for clarity]

/test pull-kubernetes-conformance-kind-ga-only
/test pull-kubernetes-e2e-kind-alpha-features
/test pull-kubernetes-e2e-kind-beta-features
/test pull-kubernetes-e2e-kind-alpha-beta-features

# special checks that run after the entire suite
# must be specially reviewed
"e2e_invariants.go":
approvers:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this guarantee that top level approver are not able to approve?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but this config guarantees auto assignment will go to this group, and there are comments throughout that you must ask.

if we want to enforce that we can refactor a bit to put the key details in a subdirectory and use no_parent_owners.

We already have no_parent_owners at the test/ level but there are a lot of people listed there.
Anyone there could already approve arbitrary changes to the test binary to add something like this, so that level of paranoia seemed unnecessary.

// this allows us to run it by default in most jobs, but it can be opted-out,
// does not run when selecting Conformance, and it can be tagged Flaky
// if we encounter issues with it
ginkgo.It(invariantsLeafText, func() {})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also use label for filters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add it as needed, but I didn't see a need yet. Did you have any in mind?

My goal is that it should run by default in most jobs, but not conformance (see which jobs just passed and failed with the check being rigged to fail currently), but it can be labeled flaky if we find problems, and any job can also filter it out at least by SKIP=invariants.

@BenTheElder
Copy link
Member Author

Will debug those verify checks. Otherwise we see that the conformance jobs didn't run the actual metrics grabbing and the other jobs did.

We can also see no flakes so far, just the intentional failure at the end with the temporarily rigged inverted metric value assertion.

@BenTheElder
Copy link
Member Author

The verify failures required adjusting to detect dry-run mode and skip the invariant metrics check in that case.

That's fixed now.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Aug 6, 2025

@BenTheElder: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-e2e-kind-alpha-features 65ee329 link false /test pull-kubernetes-e2e-kind-alpha-features
pull-kubernetes-e2e-kind-alpha-beta-features 65ee329 link false /test pull-kubernetes-e2e-kind-alpha-beta-features
pull-kubernetes-e2e-kind-beta-features 65ee329 link false /test pull-kubernetes-e2e-kind-beta-features
pull-kubernetes-e2e-kind 56385de link true /test pull-kubernetes-e2e-kind
pull-kubernetes-e2e-gce 56385de link true /test pull-kubernetes-e2e-gce

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants