-
Notifications
You must be signed in to change notification settings - Fork 41.1k
test: add retry to getMetricsFromNode and deflake 'should grab all metrics from kubelet /metrics/resource endpoint' #133392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: rphillips The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/triage accepted |
Resource("nodes"). | ||
SubResource("proxy"). | ||
Name(fmt.Sprintf("%v:%v", nodeName, kubeletPort)). | ||
Suffix(pathSuffix). | ||
Timeout(45 * time.Second). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a confusing.
So we are calling this function every 15 s (up to 120 s). And we set a timeout for 45 seconds?
There seems to be overlap where the function will be polling while we are still requesting the call?
Should we drop timeout or match the polling intervanl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are calling this function once if it succeeds. Setting a client side timeout of 45 seconds in case the client request has a timeout. If there is a timeout then the function is run again. If there is not a successful response within 2 minutes then that is fatal.
/lgtm This is not flaky in upstream so I don't think it needs to merge post code freeze. Can wait for 1.35. |
LGTM label has been added. Git tree hash: 133f5e5cef4aad7db266c9b204c64a01c0524be5
|
Assigning myself for sig-instrumentation review. /assign |
What type of PR is this?
/kind bug
/kind failing-test
Test: should grab all metrics from kubelet /metrics/resource endpoint
What this PR does / why we need it:
It's unclear why #114997 was seeing a proxy timeout. 114997 added logic to timeout the connection and error out on the timeout. In OpenShift CI we do see something similar. The case is rare - only about 3% of the time.
Instead of erroring completely we can improve the design of the getMetricsFromNode function to issue a retry, and setup the client to have a 45 second timeout. This will allow the function to gracefully recover and always return metrics - effectively deflaking the test.
45 seconds was selected for the timeout since the metrics endpoint in the Kubelet is sometimes returning metrics within 30 seconds. 30 seconds is a long time and is a separate issue under investigation.
Which issue(s) this PR is related to:
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: