Skip to content

[APF] low priorities have larger effective shares than high priorities #121982

@linxiulei

Description

@linxiulei

What happened?

When a cluster is under load (i.e. there are a fair amount of requests from all priority levels), we notice critical requests are not served on time (i.e. responded 429s). For us, critical requests are commonly in leader-election and node-high priority levels and failures to serve those requests would result in controller restarts, unhealthy nodes. We believe the total volume of those critical requests is completely under the capacity of the machine that hosts kube-apiserver. In fact, it's requests from other priority levels that consumes the majority of machine capacity.

What did you expect to happen?

With default settings, kube-apiserver is able to prioritize processing requests from leader-election and node-high priority levels. More specifically, I hope there is way to define the priority order of all priority levels and enforce a mechanism that higher priorities requests are always served first (while each priority can have a fair minimal share), only after that if the machine has remaining capacity, the lower priorities can be served.

This has a few pros:

  1. Users don't need to precisely calculate and configure the nominal shares for each priority level with considerations of different machine capacities and workload types etc.
  2. Concurrency shares are well utilized since unused shares will be passed and shared to lower priority levels.

How can we reproduce it (as minimally and precisely as possible)?

  1. Run a cluster and send a high volume of requests of all PLs
  2. kubectl get --raw /debug/api_priority_and_fairness/dump_priority_levels to observe effective concurrency shares

Anything else we need to know?

I believe it's currently a fact that APF doesn't implement prioritization, instead it reserves a portion of capacity for each priority level. Also, the reservation could be overcommitted. However, in practice, some requests are always more important than other requests. Although it's configurable, I hope the default settings can take this into consideration.

According to APF KEP and code, the default configurations of values for non-exempt priority levels indicate that the priority levels of system/workload-high/workload-low occupy 90% of the total concurrency shares.

Name Nominal Shares Lendable Proposed Borrowing Limit Guaranteed
leader-election 10 0% none 10
node-high 40 25% none 30
system 30 33% none 20
workload-high 40 50% none 20
workload-low 100 90% none 10
global-default 20 50% none 10
catch-all 5 0% none 5

If we want to apply some degree of prioritization, maybe we can think the priority order as in leader-election > node-high > system .... However, the default settings prioritize workload-low as it gets the majority of total shares.

Kubernetes version

$ kubectl version
# paste output here

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.sig/api-machineryCategorizes an issue or PR as relevant to SIG API Machinery.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions