Optimize RS Controller Performance: Reduce Work Duration Time & Minimize Cache Locking #130961

hakuna-matatah · 2025-03-21T01:35:12Z

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

This PR introduces a Daemonset indexer for the PodInformer to efficiently query Pods belonging to Daemonsets or Orphans from the InformerCache, avoiding a full namespace scan.

This improves performance and correctness at scale by minimizing the time a read lock is held on the cache, reducing blockage of CacheController delta processing. It also helps lower workqueue processing time per object/key.

I did some benchmarking against Indexing (vs) purely listing the pods from Store/Cache

For 10k pods in store, using Index it took ~30 µs (vs) ~320 µs

For 500k pods in store, using Index it took ~35 µs (vs) 18.080762ms

We can tell that, listing from store using Index is 500X times better at higher scale.

Following are the results :

=== RUN   TestGetPodsForDaemonSetUsesIndexer
    daemon_controller_test.go:922: ByIndex() fetched 100 pods in 30.252µs
    daemon_controller_test.go:930: List() fetched 10000 pods in 320.047µs

=== RUN   TestGetPodsForDaemonSetUsesIndexer
    daemon_controller_test.go:922: ByIndex() fetched 100 pods in 37.842µs
    daemon_controller_test.go:930: List() fetched 500000 pods in 18.080762ms

Optimizing this list call will provide the following benefits:

Reduce Read Lock Duration on the InformerCache, allowing writes to proceed faster and reducing chances of cache staleness at scale.
Lower Work Queue Processing Time, enabling faster convergence to the desired state.
Decreased Queue Wait Time,, ensuring items are dispatched and processed more efficiently

Made similar improvement for DS controller and Statefulset controller

Which issue(s) this PR fixes:

Fixes # #130767

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

hakuna-matatah · 2025-03-21T02:51:35Z

/assign @soltysh

hakuna-matatah · 2025-03-21T02:51:50Z

cc: @dims

hakuna-matatah · 2025-03-21T15:47:53Z

/assign @wojtek-t

dims · 2025-04-30T21:20:26Z

@hakuna-matatah the linter CI job is a easy fix, do you want to do that then we can ping folks for review?

atiratree · 2025-05-02T14:45:23Z

/triage accepted
/priority important-soon

soltysh · 2025-05-07T13:49:40Z

/assign @soltysh

Sorry, this slipped through the cracks. @hakuna-matatah can you rebase the PR, fix that linter error (adding //nolint:errcheck is ok for that one the linter is failing on, and then ping me on slack for a re-review on this one?

…ize Cache Locking

hakuna-matatah · 2025-05-12T20:50:43Z

@soltysh Sorry for the delayed response, I have been on vacation. Please check now if you get a chance.

hakuna-matatah · 2025-05-12T23:15:48Z

/retest

soltysh

/lgtm
/approve

k8s-ci-robot · 2025-05-13T14:12:58Z

LGTM label has been added.

Git tree hash: a9606cc0a2acc8b2239c5fd9255724556c746b4e

k8s-ci-robot · 2025-05-13T14:13:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hakuna-matatah, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/replicaset/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-project-automation bot added this to SIG Apps Mar 21, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Mar 21, 2025

k8s-ci-robot requested review from atiratree and mimowo March 21, 2025 01:35

k8s-ci-robot assigned soltysh Mar 21, 2025

hakuna-matatah mentioned this pull request Mar 21, 2025

Optimize Statefulset Controller Performance: Reduce Work Duration Time & Minimize Cache Locking. #130806

Merged

k8s-ci-robot assigned wojtek-t Mar 21, 2025

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 24, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2025

atiratree mentioned this pull request May 2, 2025

deployment: skip processing unchanged deployments during resync #131272

Closed

hashim21223445 approved these changes May 4, 2025

View reviewed changes

hakuna-matatah force-pushed the rs branch from accf37e to a1ef0ec Compare May 12, 2025 18:43

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 12, 2025

Optimize RS Controller Performance: Reduce Work Duration Time & Minim…

e42aba6

…ize Cache Locking

hakuna-matatah force-pushed the rs branch from a1ef0ec to e42aba6 Compare May 12, 2025 19:56

soltysh approved these changes May 13, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 13, 2025

k8s-ci-robot merged commit 1325262 into kubernetes:master May 13, 2025
14 checks passed

k8s-ci-robot added this to the v1.34 milestone May 13, 2025

github-project-automation bot moved this from Needs Triage to Done in SIG Apps May 13, 2025

soltysh mentioned this pull request May 20, 2025

add ReplicaSetFailedPodsBackoff: limit pod creation when kubelet fails ReplicaSet pods #130411

Open

xigang mentioned this pull request Jun 17, 2025

Job controller optimization: reduce work duration time & minimize cache locking #132305

Merged

hakuna-matatah mentioned this pull request Jun 19, 2025

Index pods on namespace labels for KCM controllers to query efficiently from Informer cache. Optimize Endpoint and EndpointSlice controller lock contention #132396

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize RS Controller Performance: Reduce Work Duration Time & Minimize Cache Locking #130961

Optimize RS Controller Performance: Reduce Work Duration Time & Minimize Cache Locking #130961

Uh oh!

hakuna-matatah commented Mar 21, 2025 •

edited by wojtek-t

Loading

Uh oh!

hakuna-matatah commented Mar 21, 2025

Uh oh!

hakuna-matatah commented Mar 21, 2025

Uh oh!

hakuna-matatah commented Mar 21, 2025

Uh oh!

dims commented Apr 30, 2025

Uh oh!

atiratree commented May 2, 2025

Uh oh!

soltysh commented May 7, 2025

Uh oh!

hakuna-matatah commented May 12, 2025

Uh oh!

hakuna-matatah commented May 12, 2025

Uh oh!

soltysh left a comment

Uh oh!

k8s-ci-robot commented May 13, 2025

Uh oh!

k8s-ci-robot commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Optimize RS Controller Performance: Reduce Work Duration Time & Minimize Cache Locking #130961

Optimize RS Controller Performance: Reduce Work Duration Time & Minimize Cache Locking #130961

Uh oh!

Conversation

hakuna-matatah commented Mar 21, 2025 • edited by wojtek-t Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

hakuna-matatah commented Mar 21, 2025

Uh oh!

hakuna-matatah commented Mar 21, 2025

Uh oh!

hakuna-matatah commented Mar 21, 2025

Uh oh!

dims commented Apr 30, 2025

Uh oh!

atiratree commented May 2, 2025

Uh oh!

soltysh commented May 7, 2025

Uh oh!

hakuna-matatah commented May 12, 2025

Uh oh!

hakuna-matatah commented May 12, 2025

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented May 13, 2025

Uh oh!

k8s-ci-robot commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

hakuna-matatah commented Mar 21, 2025 •

edited by wojtek-t

Loading