Skip to content

Tags: pohly/kubernetes

Tags

log-client-go-tools-cache-apis-same-interface

Toggle log-client-go-tools-cache-apis-same-interface's commit message
client-go/tools/cache: avoids API breaks

Extending the SharedInformer interface and the prototype of the NewShared*
functions is an API break. For example, controller-runtime mocks SharedInformer
and stores the factory function in a field.

We can avoid breaking such clients by adding new interfaces with the new
methods and new alternative factory functions. The downside of not having the
new methods in the SharedInformer interface is that callers of client-go do not
get access to them unless they cast.

dra-structured-performance-2024-09-09

Toggle dra-structured-performance-2024-09-09's commit message
DRA scheduler: implement CEL attribute proxies

The main advantage for the simple scenario without any attribute checks in CEL
is that it reduces memory allocations. Once a CEL expression really accesses
attributes, pre-computing the map lookup might be faster, too.

     │            before            │                     after                     │
     │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base         │
                         88.42 ± 7%                    93.07 ± 14%  ~ (p=0.310 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50  vs base         │
                        16.01 ± 6%                   17.01 ± 18%  ~ (p=0.069 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90  vs base         │
                        380.1 ± 3%                    377.0 ± 5%  ~ (p=0.485 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95  vs base         │
                        389.8 ± 2%                    379.5 ± 5%  ~ (p=0.084 n=6)

     │           before            │                       after                       │
     │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99  vs base              │
                        405.5 ± 5%                    387.5 ± 3%  -4.44% (p=0.041 n=6)

     │     before      │              after               │
     │ runtime_seconds │ runtime_seconds  vs base         │
            63.36 ± 5%        60.78 ± 7%  ~ (p=0.093 n=6)

gotestsum-old

Toggle gotestsum-old's commit message
test: filter "go test" output with gotestsum instead of grep

Filtering the output with grep leads to hard to read log output, e.g. from
pull-kubernetes-unit:

    +++ [0613 15:32:48] Running tests without code coverage and with -race
    {"Time":"2024-06-13T15:33:47.845457374Z","Action":"output","Package":"k8s.io/kubernetes/cluster/gce/cos","Test":"TestCreateMasterAuditPolicy","Output":"        /tmp/configure-helper-test47992121/kube-env: line 1: `}'\n"}
    {"Time":"2024-06-13T15:33:49.053732803Z","Action":"output","Package":"k8s.io/kubernetes/cluster/gce/cos","Output":"ok  \tk8s.io/kubernetes/cluster/gce/cos\t2.906s\n"}

We can do better than that. When feeding the output of the "go test" command(s)
into gotestsum *while it runs*, we can use --format=standard-quiet (= normal go
test output) or --format=standard-verbose (= `go test -v`) when FULL_LOG is
requested to get nicer output.

This works when testing everything at once. This was said to be not possible
when doing coverage profiling. But recent Go no longer has that limitation, so
the xargs trick gets removed. All that we need to do for coverage profiling is
to add some additional parameters and the conversion to HTML.

log-client-go-tools-cache-same-interface

Toggle log-client-go-tools-cache-same-interface's commit message
client-go, apimachinery: require using new APIs with context

All code in k/k has been updated to use the new API variants with contextual
logging support, so now this can be required for all code.

dra-1.31-2024-07-21-II

Toggle dra-1.31-2024-07-21-II's commit message
DRA kubelet: refactor gRPC call timeouts

Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
kubernetes@0449cef,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.

dra-1.31-2024-07-21-I

Toggle dra-1.31-2024-07-21-I's commit message
DRA kubelet: refactor gRPC call timeouts

Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
kubernetes@0449cef,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.

dra-1.31-2024-07-17-I

Toggle dra-1.31-2024-07-17-I's commit message
DRA: add DRAControlPlaneController feature gate for "classic DRA"

In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.

dra-1.31-2024-07-16-III

Toggle dra-1.31-2024-07-16-III's commit message
DRA: add DRAControlPlaneController feature gate for "classic DRA"

In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.

dra-1.31-2024-07-16-II

Toggle dra-1.31-2024-07-16-II's commit message
DRA: add DRAControlPlaneController feature gate for "classic DRA"

In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.

dra-1.31-2024-07-16-I

Toggle dra-1.31-2024-07-16-I's commit message
DRA scheduler: adapt to v1alpha3 API

The structured parameter allocation logic was written from scratch in
staging/src/k8s.io/dynamic-resource-allocation/structured where it might be
useful for out-of-tree components.

Besides the new features (amount, admin access) and API it now supports
backtracking when the initial device selection doesn't lead to a complete
allocation of all claims.

Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
Co-authored-by: John Belamaric <jbelamaric@google.com>