Skip to content

Operator infinite loop on transient errors from kube API #378

@porridge

Description

@porridge

This change to unit tests shows the issue:

• [FAILED] [0.011 seconds]
Updater when an update is a change [It] should apply an update status function
/home/mowsiany/go/src/github.com/operator-framework/helm-operator-plugins/pkg/reconciler/internal/updater/updater_test.go:102

  [FAILED] HaveLen matcher expects a string/array/map/channel/slice.  Got:
      <nil>: nil
  In [It] at: /home/mowsiany/go/src/github.com/operator-framework/helm-operator-plugins/pkg/reconciler/internal/updater/updater_test.go:108 @ 08/20/24 07:39:09.843

When there is a transient error from the API server, the updater never retries updates, as long as the update function correctly returns false if it has no effect.

This is because the Apply function reuses the same object on each update attempt. After the first attempt, subsequent invocations of the update functions show nothing is changed, so subsequent attempts are not made even if the first one failed.

What is worse, and could be considered a different issue on its own, in the deletion case the code proceeds to wait forever on the deletion to happen. But this never happens, because the updates done by doUninstall correctly return false on subsequent updates.

In my case the transient error was a 500 from the API server, caused in turn by a misbehaving validation webhook.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions