test: add retry to getMetricsFromNode and deflake 'should grab all metrics from kubelet /metrics/resource endpoint' #133392

rphillips · 2025-08-05T22:45:12Z

What type of PR is this?

/kind bug
/kind failing-test

Test: should grab all metrics from kubelet /metrics/resource endpoint

What this PR does / why we need it:

It's unclear why #114997 was seeing a proxy timeout. 114997 added logic to timeout the connection and error out on the timeout. In OpenShift CI we do see something similar. The case is rare - only about 3% of the time.

Instead of erroring completely we can improve the design of the getMetricsFromNode function to issue a retry, and setup the client to have a 45 second timeout. This will allow the function to gracefully recover and always return metrics - effectively deflaking the test.

45 seconds was selected for the timeout since the metrics endpoint in the Kubelet is sometimes returning metrics within 30 seconds. 30 seconds is a long time and is a separate issue under investigation.

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-08-05T22:45:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rphillips
Once this PR has been reviewed and has the lgtm label, please assign rexagod for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/e2e/framework/metrics/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rphillips · 2025-08-06T01:16:41Z

/triage accepted
/priority important-longterm

kannon92 · 2025-08-06T14:20:12Z

test/e2e/framework/metrics/metrics_grabber.go

 			Resource("nodes").
 			SubResource("proxy").
 			Name(fmt.Sprintf("%v:%v", nodeName, kubeletPort)).
 			Suffix(pathSuffix).
+			Timeout(45 * time.Second).


I find this a confusing.

So we are calling this function every 15 s (up to 120 s). And we set a timeout for 45 seconds?

There seems to be overlap where the function will be polling while we are still requesting the call?

Should we drop timeout or match the polling intervanl?

We are calling this function once if it succeeds. Setting a client side timeout of 45 seconds in case the client request has a timeout. If there is a timeout then the function is run again. If there is not a successful response within 2 minutes then that is fatal.

kannon92 · 2025-08-06T16:39:54Z

/lgtm
/milestone v1.35

This is not flaky in upstream so I don't think it needs to merge post code freeze. Can wait for 1.35.

k8s-ci-robot · 2025-08-06T16:40:02Z

LGTM label has been added.

Git tree hash: 133f5e5cef4aad7db266c9b204c64a01c0524be5

dgrisonnet · 2025-08-07T16:45:32Z

Assigning myself for sig-instrumentation review.

/assign

test: add retry to getMetricsFromNode

9a52407

k8s-ci-robot requested review from pohly and richabanker August 5, 2025 22:45

fix lint

c859b90

kannon92 reviewed Aug 6, 2025

View reviewed changes

k8s-ci-robot added this to the v1.35 milestone Aug 6, 2025

k8s-ci-robot assigned kannon92 Aug 6, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 6, 2025

openshift-ci-robot mentioned this pull request Aug 6, 2025

UPSTREAM: 133392: test: add retry to getMetricsFromNode openshift/kubernetes#2401

Open

k8s-ci-robot assigned dgrisonnet Aug 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: add retry to getMetricsFromNode and deflake 'should grab all metrics from kubelet /metrics/resource endpoint' #133392

test: add retry to getMetricsFromNode and deflake 'should grab all metrics from kubelet /metrics/resource endpoint' #133392

Uh oh!

rphillips commented Aug 5, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Aug 5, 2025

Uh oh!

rphillips commented Aug 6, 2025

Uh oh!

kannon92 Aug 6, 2025

Uh oh!

rphillips Aug 6, 2025

Uh oh!

kannon92 commented Aug 6, 2025

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

dgrisonnet commented Aug 7, 2025

Uh oh!

Uh oh!

test: add retry to getMetricsFromNode and deflake 'should grab all metrics from kubelet /metrics/resource endpoint' #133392

Are you sure you want to change the base?

test: add retry to getMetricsFromNode and deflake 'should grab all metrics from kubelet /metrics/resource endpoint' #133392

Uh oh!

Conversation

rphillips commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Aug 5, 2025

Uh oh!

rphillips commented Aug 6, 2025

Uh oh!

kannon92 Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

rphillips Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

kannon92 commented Aug 6, 2025

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

dgrisonnet commented Aug 7, 2025

Uh oh!

Uh oh!

rphillips commented Aug 5, 2025 •

edited

Loading