Skip to content

fix(cache): watch errors must call done handler #781

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

bbatha
Copy link
Contributor

@bbatha bbatha commented Mar 2, 2022

The type of watchObject was incorrect and has been updated to match
the actual request body.

Using this info it was clear that 'ERROR' events were not being handled
correctly. When the watch receives an error it is not always an http
status code, because the status code can only be sent when the stream is
starting. This means that 410 resourceVersion out of date errors could
only be handled if they were detected before the watch stream started
leaving watches running on channels that would never receive more events
and not notifying ListWatch consumers of the error.

The type of `watchObject` was incorrect and has been updated to match
the actual request body.

Using this info it was clear that 'ERROR' events were not being handled
correctly. When the watch receives an error it is not always an http
status code, because the status code can only be sent when the stream is
starting. This means that `410` resourceVersion out of date errors could
only be handled if they were detected before the watch stream started
leaving watches running on channels that would never receive more events
and not notifying `ListWatch` consumers of the error.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 2, 2022
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 2, 2022
@brendandburns
Copy link
Contributor

/lgtm
/approve

Thanks for the PR!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bbatha, brendandburns

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2022
@k8s-ci-robot k8s-ci-robot merged commit ea5041d into kubernetes-client:master Mar 3, 2022
@bbatha bbatha deleted the fix-informer-watch-optimization-restart branch March 3, 2022 20:19
@Kaggggggga
Copy link

hello, I wonder when this fix will be published to next release?

some background of my case:
after long running, my operator using v0.16.3 sdk on aws eks 1.21 will stuck into repeatedly watch request call without long polling
according to api server access log, their response is 200, but when i test with the request path, it is showing http200 with 410 error object, and no list api is called
even i implement something like if no event for 15min, will stop and start again, but the present of old resourceVersion will skip the list api call to refresh resourceVersion

my concluded timeline:
therefore without this fix, it will
run for long time =>
until watching initial resourceVersion is too old, the long polling will push a error object without handling =>
15min no new event, watch got restarted without list api call =>
got status 200 but 410 error, request stream closed, recreate watch without list =>
got status 200 but 410 error, request stream closed, recreate watch without list => .
...........repeatedly one by one when 410 watch request is done, causing frequent api request to api server

you could reproduce by setting a expired initial value of resourceVersion at ListWatch for see my case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants