-
Notifications
You must be signed in to change notification settings - Fork 41.1k
Description
What happened?
When a cluster is under load (i.e. there are a fair amount of requests from all priority levels), we notice critical requests are not served on time (i.e. responded 429s). For us, critical requests are commonly in leader-election
and node-high
priority levels and failures to serve those requests would result in controller restarts, unhealthy nodes. We believe the total volume of those critical requests is completely under the capacity of the machine that hosts kube-apiserver
. In fact, it's requests from other priority levels that consumes the majority of machine capacity.
What did you expect to happen?
With default settings, kube-apiserver is able to prioritize processing requests from leader-election
and node-high
priority levels. More specifically, I hope there is way to define the priority order of all priority levels and enforce a mechanism that higher priorities requests are always served first (while each priority can have a fair minimal share), only after that if the machine has remaining capacity, the lower priorities can be served.
This has a few pros:
- Users don't need to precisely calculate and configure the nominal shares for each priority level with considerations of different machine capacities and workload types etc.
- Concurrency shares are well utilized since unused shares will be passed and shared to lower priority levels.
How can we reproduce it (as minimally and precisely as possible)?
- Run a cluster and send a high volume of requests of all PLs
kubectl get --raw /debug/api_priority_and_fairness/dump_priority_levels
to observe effective concurrency shares
Anything else we need to know?
I believe it's currently a fact that APF doesn't implement prioritization, instead it reserves a portion of capacity for each priority level. Also, the reservation could be overcommitted. However, in practice, some requests are always more important than other requests. Although it's configurable, I hope the default settings can take this into consideration.
According to APF KEP and code, the default configurations of values for non-exempt priority levels indicate that the priority levels of system/workload-high/workload-low occupy 90% of the total concurrency shares.
Name | Nominal Shares | Lendable | Proposed Borrowing Limit | Guaranteed |
---|---|---|---|---|
leader-election | 10 | 0% | none | 10 |
node-high | 40 | 25% | none | 30 |
system | 30 | 33% | none | 20 |
workload-high | 40 | 50% | none | 20 |
workload-low | 100 | 90% | none | 10 |
global-default | 20 | 50% | none | 10 |
catch-all | 5 | 0% | none | 5 |
If we want to apply some degree of prioritization, maybe we can think the priority order as in leader-election > node-high > system ...
. However, the default settings prioritize workload-low
as it gets the majority of total shares.
Kubernetes version
$ kubectl version
# paste output here
Cloud provider
OS version
# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here
# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here