Tags · pohly/kubernetes

log-client-go-tools-cache-apis-same-interface

client-go/tools/cache: avoids API breaks

Extending the SharedInformer interface and the prototype of the NewShared*
functions is an API break. For example, controller-runtime mocks SharedInformer
and stores the factory function in a field.

We can avoid breaking such clients by adding new interfaces with the new
methods and new alternative factory functions. The downside of not having the
new methods in the SharedInformer interface is that callers of client-go do not
get access to them unless they cast.

Sep 24, 2024
efa8146
zip
tar.gz

dra-structured-performance-2024-09-09

DRA scheduler: implement CEL attribute proxies

The main advantage for the simple scenario without any attribute checks in CEL
is that it reduces memory allocations. Once a CEL expression really accesses
attributes, pre-computing the map lookup might be faster, too.

     │            before            │                     after                     │
     │ SchedulingThroughput/Average │ SchedulingThroughput/Average  vs base         │
                         88.42 ± 7%                    93.07 ± 14%  ~ (p=0.310 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc50 │ SchedulingThroughput/Perc50  vs base         │
                        16.01 ± 6%                   17.01 ± 18%  ~ (p=0.069 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc90 │ SchedulingThroughput/Perc90  vs base         │
                        380.1 ± 3%                    377.0 ± 5%  ~ (p=0.485 n=6)

     │           before            │                    after                     │
     │ SchedulingThroughput/Perc95 │ SchedulingThroughput/Perc95  vs base         │
                        389.8 ± 2%                    379.5 ± 5%  ~ (p=0.084 n=6)

     │           before            │                       after                       │
     │ SchedulingThroughput/Perc99 │ SchedulingThroughput/Perc99  vs base              │
                        405.5 ± 5%                    387.5 ± 3%  -4.44% (p=0.041 n=6)

     │     before      │              after               │
     │ runtime_seconds │ runtime_seconds  vs base         │
            63.36 ± 5%        60.78 ± 7%  ~ (p=0.093 n=6)

Sep 10, 2024
37f4fc0
zip
tar.gz

gotestsum-old

test: filter "go test" output with gotestsum instead of grep

Filtering the output with grep leads to hard to read log output, e.g. from
pull-kubernetes-unit:

    +++ [0613 15:32:48] Running tests without code coverage and with -race
    {"Time":"2024-06-13T15:33:47.845457374Z","Action":"output","Package":"k8s.io/kubernetes/cluster/gce/cos","Test":"TestCreateMasterAuditPolicy","Output":"        /tmp/configure-helper-test47992121/kube-env: line 1: `}'\n"}
    {"Time":"2024-06-13T15:33:49.053732803Z","Action":"output","Package":"k8s.io/kubernetes/cluster/gce/cos","Output":"ok  \tk8s.io/kubernetes/cluster/gce/cos\t2.906s\n"}

We can do better than that. When feeding the output of the "go test" command(s)
into gotestsum *while it runs*, we can use --format=standard-quiet (= normal go
test output) or --format=standard-verbose (= `go test -v`) when FULL_LOG is
requested to get nicer output.

This works when testing everything at once. This was said to be not possible
when doing coverage profiling. But recent Go no longer has that limitation, so
the xargs trick gets removed. All that we need to do for coverage profiling is
to add some additional parameters and the conversion to HTML.

Aug 2, 2024
20df989
zip
tar.gz

log-client-go-tools-cache-same-interface

client-go, apimachinery: require using new APIs with context

All code in k/k has been updated to use the new API variants with contextual
logging support, so now this can be required for all code.

Jul 30, 2024
2bb0843
zip
tar.gz

dra-1.31-2024-07-21-II

DRA kubelet: refactor gRPC call timeouts

Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
kubernetes@0449cef,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.

Jul 21, 2024
85f0965
zip
tar.gz

dra-1.31-2024-07-21-I

DRA kubelet: refactor gRPC call timeouts

Some of the E2E node tests were flaky. Their timeout apparently was chosen
under the assumption that kubelet would retry immediately after a failed gRPC
call, with a factor of 2 as safety margin. But according to
kubernetes@0449cef,
kubelet has a different, higher retry period of 90 seconds, which was exactly
the test timeout. The test timeout has to be higher than that.

As the tests don't use the gRPC call timeout anymore, it can be made
private. While at it, the name and documentation gets updated.

Jul 19, 2024
13c74c9
zip
tar.gz

dra-1.31-2024-07-17-I

DRA: add DRAControlPlaneController feature gate for "classic DRA"

In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.

Jul 17, 2024
f18fc36
zip
tar.gz

dra-1.31-2024-07-16-III

DRA: add DRAControlPlaneController feature gate for "classic DRA"

In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.

Jul 16, 2024
83c8688
zip
tar.gz

dra-1.31-2024-07-16-II

DRA: add DRAControlPlaneController feature gate for "classic DRA"

In the API, the effect of the feature gate is that alpha fields get dropped on
create. They get preserved during updates if already set. The
PodSchedulingContext registration is *not* restricted by the feature gate.
This enables deleting stale PodSchedulingContext objects after disabling
the feature gate.

The scheduler checks the new feature gate before setting up an informer for
PodSchedulingContext objects and when deciding whether it can schedule a
pod. If any claim depends on a control plane controller, the scheduler bails
out, leading to:

    Status:       Pending
    ...
      Warning  FailedScheduling             73s   default-scheduler  0/1 nodes are available: resourceclaim depends on disabled DRAControlPlaneController feature. no new claims to deallocate, preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.

The rest of the changes prepare for testing the new feature separately from
"structured parameters". The goal is to have base "dra" jobs which just enable
and test those, then "classic-dra" jobs which add DRAControlPlaneController.

Jul 16, 2024
40a45e6
zip
tar.gz

dra-1.31-2024-07-16-I

DRA scheduler: adapt to v1alpha3 API

The structured parameter allocation logic was written from scratch in
staging/src/k8s.io/dynamic-resource-allocation/structured where it might be
useful for out-of-tree components.

Besides the new features (amount, admin access) and API it now supports
backtracking when the initial device selection doesn't lead to a complete
allocation of all claims.

Co-authored-by: Ed Bartosh <eduard.bartosh@intel.com>
Co-authored-by: John Belamaric <jbelamaric@google.com>

Jul 16, 2024
ade5cd5
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

log-client-go-tools-cache-apis-same-interface

dra-structured-performance-2024-09-09

gotestsum-old

log-client-go-tools-cache-same-interface

dra-1.31-2024-07-21-II

dra-1.31-2024-07-21-I

dra-1.31-2024-07-17-I

dra-1.31-2024-07-16-III

dra-1.31-2024-07-16-II

dra-1.31-2024-07-16-I

Tags: pohly/kubernetes