-
Notifications
You must be signed in to change notification settings - Fork 41.1k
Optimize Statefulset Controller Performance: Reduce Work Duration Time & Minimize Cache Locking. #130806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/retest |
@hakuna-matatah please add some context for the graphs in the PR body as well |
Done @dims. The graph above shows that the 5-minute P99 average of work_queue_work_duration_seconds ranges from ~1s to 8s, representing the time taken to process each item in the StatefulSet queue. Additionally, the 5-minute P99 average wait time for an item in the queue before processing is ~1s. From the pprof analysis, most cycles are spent on listing pods. Optimizing this list call will provide the following benefits:
|
+1 |
/assign @soltysh @atiratree |
/release-note-none |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
16 similar comments
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
The Kubernetes project has merge-blocking tests that are currently too flaky to consistently pass. This bot retests PRs for certain kubernetes repos according to the following rules:
You can:
/retest |
@hakuna-matatah: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/lgtm cancel [Just to avoid constant resting of linter hints] |
@hakuna-matatah you have a approve/lgtm from @wojtek-t #130806 (comment) However there is a linter CI job that is failing, can you please fix it? then i can re-apply lgtm? |
/lgtm |
LGTM label has been added. Git tree hash: 963844811a9687d111f4bba3de5e133d598d99bd
|
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
This PR introduces a StatefulSet indexer for the PodInformer to efficiently query Pods belonging to StatefulSets or Orphans from the InformerCache, avoiding a full namespace scan.
This improves performance and correctness at scale by minimizing the time a read lock is held on the cache, reducing blockage of CacheController delta processing. It also helps lower workqueue processing time per object/key.
getPodsForStatefulset
phase, particularly in the ListAllByNamespace call. Another major contributor is the ClaimPods phase, which involves processing all those pods. Reducing the number of pods processed per cycle would not only alleviate lock contention on the cache (allowing the CacheController to write more efficiently) but also decrease work_queue_processing_time within the sync loop. This optimization would enhance the performance and throughput of the DaemonSet (DS) controller.Optimizing this list call will provide the following benefits:
Which issue(s) this PR fixes:
Fixes #
#130767
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: