Skip to content

[WIP] Fix race conditions and deadlocks in Reflector #125107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented May 23, 2024

What type of PR is this?

/kind cleanup
/kind flake

What & Why?

  • Ensure watcher.Stop() is called after the consumer is done consuming
    events from the watcher. This is a signal to the producer that it
    should stop trying to send events on the result channel. Without it,
    a producer that sends on an unbuffered channel can get stuck waiting
    for an event to be read by a consumer who is no longer reading.
  • Add extra checks to exit early when the stop channel has been closed,
    to avoid unnecessary API calls.
  • Refactor watchHandler to be easier to read. Use input and output
    bools, instead of a bool pointer.
  • Fix some of the unit tests to use the stop channel to stop ListAndWatch,
    as a real user would, instead of relying on the quirk that
    FakeWatcher.Stop() happens to close the result channel.
  • Add additional unit tests to confirm that the watch handler works
    correctly in various permutations where the result channel or stop
    channel is closed, before and after watching has started.
  • Add a MockWatcher for more explicit testing, including order of
    operations and result channel closure.
  • Remove a duplicate assertion in TestReflectorWatchHandler
  • Use assert instead of require in a few cases where debugging a failed
    test is easier if you can see all the failed assertions at once.
  • Fix some lint errors regarding error comparison (use errors.Is & As)

These changes make the informer/reflector safer to use with third-party watchers, which are often faked for testing controllers.

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 23, 2024
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2024
@karlkfi karlkfi force-pushed the karl-watch-resultchan-fix branch from 23747ce to c53ee1f Compare May 23, 2024 20:43
@karlkfi
Copy link
Contributor Author

karlkfi commented May 23, 2024

/cc @p0lyn0mial

Looks like @p0lyn0mial did the last big refactor here and might have the most context.

@karlkfi
Copy link
Contributor Author

karlkfi commented May 24, 2024

It looks like this breaks TestWatchList, for multiple reasons. So I'm going to need to rework it a bit.

  1. Apparently, the new WatchList feature expects to re-use the same watcher and NOT always close it, which is different from the old watcher behavior. Instead, it waits for a bookmark to indicate that the listing is done and passes the watcher to Reflector.watch to finish watching on the same connection. So I'll need to restore that behavior.
  2. The FakeWatcher is not actually threadsafe. It has a multiple race conditions:
    • FakeWatcher.Reset writes to the result channel, which races with the event producers writing to it. We probably just need to remove Reset. I don't think anything is using it, and it's kind of an anti-pattern from a go channel usage perspective.
    • TestWatchList is calling FakeWatcher.Stop from the producer side, which is wrong. The producer should only ever close the result channel, to tell the consumer no more events are coming. The consumer is the only one that should be calling watcher.Stop, to tell the producer that it's no longer reading events from the result channel. I think I can fix this by adding another method to close the result channel, or accept the channel in the constructor.

@karlkfi karlkfi force-pushed the karl-watch-resultchan-fix branch from bb05d5b to 5991f45 Compare May 25, 2024 04:24
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. area/apiserver area/kubelet sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/node Categorizes an issue or PR as relevant to SIG Node. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 25, 2024
@karlkfi karlkfi force-pushed the karl-watch-resultchan-fix branch from 5991f45 to 193fc54 Compare May 25, 2024 05:26
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/kubectl sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 25, 2024
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label May 25, 2024
@karlkfi karlkfi force-pushed the karl-watch-resultchan-fix branch from 193fc54 to a734401 Compare May 25, 2024 07:11
@k8s-ci-robot k8s-ci-robot added area/kube-proxy sig/network Categorizes an issue or PR as relevant to SIG Network. labels May 25, 2024
@karlkfi karlkfi force-pushed the karl-watch-resultchan-fix branch from a734401 to fb39522 Compare May 28, 2024 18:34
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 28, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: karlkfi
Once this PR has been reviewed and has the lgtm label, please assign liggitt for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karlkfi
Copy link
Contributor Author

karlkfi commented May 28, 2024

Alright, I tried to fix the problem with the FakeWatcher closing the result channel immediately on Stop() without waiting for the producer to finish sending events, but that turned into a much larger change. So I walked that back and left some TODOs instead. To avoid needing all downstream users to refactor immediately on upgrade, it might be better in the future to add a new fixed FakeWatcher and deprecate the existing one. But I'll leave that for a separate PR to keep this one more reviewable.

@karlkfi karlkfi force-pushed the karl-watch-resultchan-fix branch from fb39522 to 135ed55 Compare May 28, 2024 18:42
@cici37
Copy link
Contributor

cici37 commented May 28, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 28, 2024
- Ensure watcher.Stop() is called after the consumer is done consuming
  events from the watcher. This is a signal to the producer that it
  should stop trying to send events on the result channel. Without it,
  a producer that sends on an unbuffered channel can get stuck waiting
  for an event to be read by a consumer who is no longer reading.
- Ensure
- Add extra checks to exit early when the stop channel has been closed,
  to avoid unnecessary API calls.
- Refactor watchHandler to be easier to read. Use input and output
  bools, instead of a bool pointer.
- Fix some of the unit tests to use the stop channel to stop ListAndWatch,
  as a real user would, instead of relying on the quirk that
  FakeWatcher.Stop() happens to close the result channel.
- Add additional unit tests to confirm that the watch handler works
  correctly in various permutations where the result channel or stop
  channel is closed, before and after watching has started.
- Add a MockWatcher for more explicit testing, including order of
  operations and result channel closure.
- Remove a duplicate assertion in TestReflectorWatchHandler
- Use assert instead of require in a few cases where debugging a failed
  test is easier if you can see all the failed assertions at once.
- Fix some lint errors regarding error comparison (use errors.Is & As)
@karlkfi karlkfi force-pushed the karl-watch-resultchan-fix branch from 135ed55 to 8fd4501 Compare May 28, 2024 21:08
@karlkfi
Copy link
Contributor Author

karlkfi commented May 28, 2024

/cc @jpbetz

@thockin said he doesn't have context here and suggested @jpbetz instead

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz May 28, 2024 22:56
@karlkfi
Copy link
Contributor Author

karlkfi commented May 30, 2024

Moved the FakeWatcher changes out to their own PR, since it requires updating all the tests that use it: #125204

@karlkfi karlkfi changed the title Fix race conditions and deadlocks in Reflector [WIP] Fix race conditions and deadlocks in Reflector Jun 3, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 3, 2024
@karlkfi
Copy link
Contributor Author

karlkfi commented Jun 3, 2024

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 4, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 3, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/kube-proxy area/kubectl area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Archived in project
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants