[WIP] Fix race conditions and deadlocks in Reflector #125107

karlkfi · 2024-05-23T20:06:50Z

What type of PR is this?

/kind cleanup
/kind flake

What & Why?

Ensure watcher.Stop() is called after the consumer is done consuming
events from the watcher. This is a signal to the producer that it
should stop trying to send events on the result channel. Without it,
a producer that sends on an unbuffered channel can get stuck waiting
for an event to be read by a consumer who is no longer reading.
Add extra checks to exit early when the stop channel has been closed,
to avoid unnecessary API calls.
Refactor watchHandler to be easier to read. Use input and output
bools, instead of a bool pointer.
Fix some of the unit tests to use the stop channel to stop ListAndWatch,
as a real user would, instead of relying on the quirk that
FakeWatcher.Stop() happens to close the result channel.
Add additional unit tests to confirm that the watch handler works
correctly in various permutations where the result channel or stop
channel is closed, before and after watching has started.
Add a MockWatcher for more explicit testing, including order of
operations and result channel closure.
Remove a duplicate assertion in TestReflectorWatchHandler
Use assert instead of require in a few cases where debugging a failed
test is easier if you can see all the failed assertions at once.
Fix some lint errors regarding error comparison (use errors.Is & As)

These changes make the informer/reflector safer to use with third-party watchers, which are often faked for testing controllers.

Does this PR introduce a user-facing change?

NONE

karlkfi · 2024-05-23T20:54:46Z

/cc @p0lyn0mial

Looks like @p0lyn0mial did the last big refactor here and might have the most context.

karlkfi · 2024-05-24T20:09:29Z

It looks like this breaks TestWatchList, for multiple reasons. So I'm going to need to rework it a bit.

Apparently, the new WatchList feature expects to re-use the same watcher and NOT always close it, which is different from the old watcher behavior. Instead, it waits for a bookmark to indicate that the listing is done and passes the watcher to Reflector.watch to finish watching on the same connection. So I'll need to restore that behavior.
The FakeWatcher is not actually threadsafe. It has a multiple race conditions:
- FakeWatcher.Reset writes to the result channel, which races with the event producers writing to it. We probably just need to remove Reset. I don't think anything is using it, and it's kind of an anti-pattern from a go channel usage perspective.
- TestWatchList is calling FakeWatcher.Stop from the producer side, which is wrong. The producer should only ever close the result channel, to tell the consumer no more events are coming. The consumer is the only one that should be calling watcher.Stop, to tell the producer that it's no longer reading events from the result channel. I think I can fix this by adding another method to close the result channel, or accept the channel in the constructor.

k8s-ci-robot · 2024-05-28T18:35:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: karlkfi
Once this PR has been reviewed and has the lgtm label, please assign liggitt for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karlkfi · 2024-05-28T18:38:21Z

Alright, I tried to fix the problem with the FakeWatcher closing the result channel immediately on Stop() without waiting for the producer to finish sending events, but that turned into a much larger change. So I walked that back and left some TODOs instead. To avoid needing all downstream users to refactor immediately on upgrade, it might be better in the future to add a new fixed FakeWatcher and deprecate the existing one. But I'll leave that for a separate PR to keep this one more reviewable.

cici37 · 2024-05-28T20:15:43Z

/triage accepted

- Ensure watcher.Stop() is called after the consumer is done consuming events from the watcher. This is a signal to the producer that it should stop trying to send events on the result channel. Without it, a producer that sends on an unbuffered channel can get stuck waiting for an event to be read by a consumer who is no longer reading. - Ensure - Add extra checks to exit early when the stop channel has been closed, to avoid unnecessary API calls. - Refactor watchHandler to be easier to read. Use input and output bools, instead of a bool pointer. - Fix some of the unit tests to use the stop channel to stop ListAndWatch, as a real user would, instead of relying on the quirk that FakeWatcher.Stop() happens to close the result channel. - Add additional unit tests to confirm that the watch handler works correctly in various permutations where the result channel or stop channel is closed, before and after watching has started. - Add a MockWatcher for more explicit testing, including order of operations and result channel closure. - Remove a duplicate assertion in TestReflectorWatchHandler - Use assert instead of require in a few cases where debugging a failed test is easier if you can see all the failed assertions at once. - Fix some lint errors regarding error comparison (use errors.Is & As)

karlkfi · 2024-05-28T22:56:28Z

/cc @jpbetz

@thockin said he doesn't have context here and suggested @jpbetz instead

karlkfi · 2024-05-30T01:25:00Z

Moved the FakeWatcher changes out to their own PR, since it requires updating all the tests that use it: #125204

karlkfi · 2024-06-03T22:49:05Z

Was asked to break this down. So here's the first few bits:

k8s-ci-robot · 2024-06-04T00:57:37Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2024-08-02T12:18:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-08-02T12:18:52Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from caesarxuchao and thockin May 23, 2024 20:07

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2024

karlkfi force-pushed the karl-watch-resultchan-fix branch from 23747ce to c53ee1f Compare May 23, 2024 20:43

k8s-ci-robot requested a review from p0lyn0mial May 23, 2024 20:54

karlkfi mentioned this pull request May 23, 2024

[WIP] Move selector matching to apimachinery #124801

Closed

karlkfi force-pushed the karl-watch-resultchan-fix branch from c53ee1f to bb05d5b Compare May 23, 2024 21:19

karlkfi force-pushed the karl-watch-resultchan-fix branch from bb05d5b to 5991f45 Compare May 25, 2024 04:24

karlkfi force-pushed the karl-watch-resultchan-fix branch from 5991f45 to 193fc54 Compare May 25, 2024 05:26

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/kubectl sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 25, 2024

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label May 25, 2024

karlkfi force-pushed the karl-watch-resultchan-fix branch from 193fc54 to a734401 Compare May 25, 2024 07:11

k8s-ci-robot added area/kube-proxy sig/network Categorizes an issue or PR as relevant to SIG Network. labels May 25, 2024

karlkfi force-pushed the karl-watch-resultchan-fix branch from a734401 to fb39522 Compare May 28, 2024 18:34

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 28, 2024

karlkfi force-pushed the karl-watch-resultchan-fix branch from fb39522 to 135ed55 Compare May 28, 2024 18:42

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 28, 2024

karlkfi force-pushed the karl-watch-resultchan-fix branch from 135ed55 to 8fd4501 Compare May 28, 2024 21:08

k8s-ci-robot requested a review from jpbetz May 28, 2024 22:56

This was referenced May 31, 2024

[WIP] Refactor FakeWatcher to be thread-safe & race-free #125204

Closed

Improve Reflector unit tests #125299

Merged

Fix Reflector to always close the watcher #125301

Closed

karlkfi changed the title ~~Fix race conditions and deadlocks in Reflector~~ [WIP] Fix race conditions and deadlocks in Reflector Jun 3, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 3, 2024

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2024

dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 3, 2024

k8s-ci-robot closed this Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Fix race conditions and deadlocks in Reflector #125107

[WIP] Fix race conditions and deadlocks in Reflector #125107

Uh oh!

karlkfi commented May 23, 2024 •

edited

Loading

Uh oh!

karlkfi commented May 23, 2024

Uh oh!

karlkfi commented May 24, 2024

Uh oh!

k8s-ci-robot commented May 28, 2024

Uh oh!

karlkfi commented May 28, 2024

Uh oh!

cici37 commented May 28, 2024

Uh oh!

karlkfi commented May 28, 2024

Uh oh!

karlkfi commented May 30, 2024

Uh oh!

karlkfi commented Jun 3, 2024 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 4, 2024

Uh oh!

k8s-triage-robot commented Aug 2, 2024

Uh oh!

k8s-ci-robot commented Aug 2, 2024

Uh oh!

Uh oh!

[WIP] Fix race conditions and deadlocks in Reflector #125107

[WIP] Fix race conditions and deadlocks in Reflector #125107

Uh oh!

Conversation

karlkfi commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What & Why?

Does this PR introduce a user-facing change?

Uh oh!

karlkfi commented May 23, 2024

Uh oh!

karlkfi commented May 24, 2024

Uh oh!

k8s-ci-robot commented May 28, 2024

Uh oh!

karlkfi commented May 28, 2024

Uh oh!

cici37 commented May 28, 2024

Uh oh!

karlkfi commented May 28, 2024

Uh oh!

karlkfi commented May 30, 2024

Uh oh!

karlkfi commented Jun 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 4, 2024

Uh oh!

k8s-triage-robot commented Aug 2, 2024

Uh oh!

k8s-ci-robot commented Aug 2, 2024

Uh oh!

Uh oh!

karlkfi commented May 23, 2024 •

edited

Loading

karlkfi commented Jun 3, 2024 •

edited

Loading