Skip to content

fix flaky garbage collector tests #131211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 14, 2025
Merged

Conversation

BenTheElder
Copy link
Member

What type of PR is this?

/kind flake

What this PR does / why we need it:

  • Ensures e2e tests don't assume 100% of allocatable pods are consumable
  • Marks these tests Serial to ensure we don't attempt to run multiple e2e tests spamming many pods at the same time

Which issue(s) this PR fixes:

Fixes #124369

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 8, 2025
@BenTheElder
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 8, 2025
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/testing Categorizes an issue or PR as relevant to SIG Testing. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 8, 2025
@BenTheElder
Copy link
Member Author

/test

@k8s-ci-robot

This comment was marked as resolved.

@BenTheElder
Copy link
Member Author

/test pull-kubernetes-conformance-kind-ga-only

adds [Serial] tags to some tests that schedule lots of pods
@k8s-ci-robot k8s-ci-robot added area/conformance Issues or PRs related to kubernetes conformance tests sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. and removed approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Apr 8, 2025
@BenTheElder
Copy link
Member Author

/test pull-kubernetes-conformance-kind-ga-only

func estimateMaximumPods(ctx context.Context, c clientset.Interface, min, max int32) int32 {
nodes, err := e2enode.GetReadySchedulableNodes(ctx, c)
framework.ExpectNoError(err)

availablePods := int32(0)
// estimate some reasonable overhead per-node for pods that are non-test
const daemonSetReservedPods = 10
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is semi-arbitrary. I rounded up to the nearest order of magnitude.

I think it's unlikely that a node has zero allocatable pods that aren't required for other purposes like networking daemons, greater than 10 seems excessive but possible.

Really this test is fundamentally fragile. We could instead assume some low N per node, but any value will be somewhat arbitrary. This should at least be better.

The real fix in this PR is tagging the tests [Serial] so they don't cause flakes in other tests while they're running. We only run [Serial] tagged tests when we're not running tests in parallel (TODO: someday instead the suite could handle this internally)

For conformance that is no real change, because conformance submissions are supposed to be done one test at a time (that does mean the tests take hours, but it reduces the chance of load related flakes versus testing the behaviors one by one)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only run [Serial] tagged tests when we're not running tests in parallel (TODO: someday instead the suite could handle this internally)

Ginkgo v2 already handles that automatically. Jobs which run tests in parallel do not need to skip [Serial] anymore, unless they explicitly want to run all serial tests separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we disable serial tests in helper scripts when running in parallel? We should remove that because it's not necessary and makes it impossible to run all tests in a single job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally do.
We should look to change that in 1.34

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will again have to be careful, tests that need to be run serially are often less stable to begin with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related https://onsi.github.io/ginkgo/#serial-specs and #111944. but Serial is the decorator not the tag, I think Patrick already moved all tags to decorators for Serial

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that these get [Serial] added to the name (to disambiguate from "tags") and that many jobs will skip them because we have [Serial] in the skip regex when doing parallel testing.

pull-kubernetes-e2e-gce and pull-kubernetes-e2e-kind (and similar jobs) both explicitly skip [Serial] when invoking ginkgo.

@BenTheElder
Copy link
Member Author

/cc @smarterclayton @johnbelamaric

@aojea
Copy link
Member

aojea commented Apr 9, 2025

conformance job timed out but I do not see this test is taking a lot of time

"component":"entrypoint","file":"sigs.k8s.io/prow/pkg/entrypoint/run.go:169","func":"sigs.k8s.io/prow/pkg/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 2h0m0s timeout","severity":"error","time":"2025-04-09T00:57:59Z"}

/retest

@BenTheElder
Copy link
Member Author

BenTheElder commented Apr 9, 2025

conformance job timed out but I do not see this test is taking a lot of time

I think moving these to serial increased the total runtime a bit.

Creating pods slower must be increasing the runtime, pending kubernetes/test-infra#34663

@BenTheElder
Copy link
Member Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 10, 2025
@BenTheElder
Copy link
Member Author

We need kubernetes/test-infra#34663 (or longer) to get full results, presumably the tests take longer due to us throttling the pod creation rate.

@BenTheElder
Copy link
Member Author

/test pull-kubernetes-conformance-kind-ga-only

@BenTheElder
Copy link
Member Author

The job completed in 2h4m
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/131211/pull-kubernetes-conformance-kind-ga-only/1910691823370637312

With:

Ginkgo ran 1 suite in 1h47m47.656016505s

On a PR without this change it completed in 1h59m just under the previous 2h timeout:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/126563/pull-kubernetes-conformance-kind-ga-only/1910692083300044800

With:

Ginkgo ran 1 suite in 1h41m26.773124098s

So the regression in test runtime is a few minutes when run non-parallel.
That's worth reducing flakiness IMHO, though the computation of max pods is still a little dubious.

@aojea
Copy link
Member

aojea commented Apr 12, 2025

/lgtm
/approve

/assign @dims

this is causing a lot of flakiness

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: e0ed7b4522f15ede7bcfe3dd3838610bc7d90bcd

@dims
Copy link
Member

dims commented Apr 12, 2025

/approve
/lgtm

cc @kubernetes/release-team @npolshakova

@npolshakova can we please add milestone for this one?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, BenTheElder, dims

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2025
@aojea
Copy link
Member

aojea commented Apr 14, 2025

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2025
@npolshakova
Copy link
Contributor

/milestone v1.33

@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Apr 14, 2025
@k8s-ci-robot k8s-ci-robot merged commit d6e3f34 into kubernetes:master Apr 14, 2025
12 of 16 checks passed
@BenTheElder BenTheElder deleted the max-pods branch April 16, 2025 17:55
saisindhuri91 added a commit to linux-on-ibm-z/kubernetes that referenced this pull request Apr 23, 2025
commit 9ba7dce
Author: Kubernetes Release Robot <k8s-release-robot@users.noreply.github.com>
Date:   Tue Apr 22 16:27:30 2025 +0000

    CHANGELOG: Update directory for v1.30.12 release

commit 191c34e
Author: Kubernetes Release Robot <k8s-release-robot@users.noreply.github.com>
Date:   Tue Apr 22 16:15:31 2025 +0000

    CHANGELOG: Update directory for v1.31.8 release

commit 7bf818f
Author: Kubernetes Release Robot <k8s-release-robot@users.noreply.github.com>
Date:   Tue Apr 22 16:21:58 2025 +0000

    CHANGELOG: Update directory for v1.32.4 release

commit 680ea07
Merge: 0d9dccf e467c95
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Mon Apr 21 09:19:06 2025 -0700

    Merge pull request kubernetes#131369 from ameukam/update-1242-master

    [Go] Bump dependencies, images and versions used to Go 1.24.2 and distroless iptables

commit 0d9dccf
Merge: 66931f0 95b926c
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Sat Apr 19 16:20:59 2025 -0700

    Merge pull request kubernetes#131382 from liggitt/watch-list-e2e

    Correctly feature-gate WatchList e2e

commit 95b926c
Author: Jordan Liggitt <liggitt@google.com>
Date:   Sat Apr 19 17:18:31 2025 -0400

    Feature-gate watchlist e2e

commit 66931f0
Merge: b53b9fb 660df22
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Fri Apr 18 07:55:08 2025 -0700

    Merge pull request kubernetes#131359 from deads2k/disable

    Stop exposing list-via-watch from the server

commit e467c95
Author: Arnaud Meukam <ameukam@gmail.com>
Date:   Fri Apr 18 15:49:48 2025 +0200

    [Go] Bump dependencies, images and versions used to Go 1.24.2 and distroless-iptables

    Signed-off-by: Arnaud Meukam <ameukam@gmail.com>

commit 660df22
Author: David Eads <deads@redhat.com>
Date:   Thu Apr 17 16:34:46 2025 -0400

    Stop exposing list-via-watch from the server

    With StreamingCollectionEncodingToJSON and
    StreamingCollectionEncodingToProtobuf, the WatchList must re-justify its
    necessity.  To prevent an ecosystem from building around a feature that
    may not be promoted, we will stop serving list-via-watch until
    performance numbers can justify its inclusion.

    This also stops the kube-controller-manager from using the
    list-via-watch by default.  The fallback is a regular list, so during
    the skew during an upgrade the "right" thing will happen and the new
    StreamingCollectionEncoding will be used.

commit b53b9fb
Merge: 44c230b a8f6d77
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Wed Apr 16 11:05:08 2025 -0700

    Merge pull request kubernetes#131015 from aojea/final_servicecidr

    Add the missing Conformance Test for ServiceCIDR and IPAddress APIS

commit a8f6d77
Author: Antonio Ojea <aojea@google.com>
Date:   Tue Apr 15 14:52:14 2025 +0000

    ServiceCIDR and IPAddess Conformance

    Change-Id: I6ee188cc8c163c312f8a8da9f1277d83e1ea634c

commit 44c230b
Author: Kubernetes Release Robot <k8s-release-robot@users.noreply.github.com>
Date:   Tue Apr 15 17:20:14 2025 +0000

    CHANGELOG: Update directory for v1.33.0-rc.1 release

commit 30469e1
Merge: cfed8c7 0266d3b
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Mon Apr 14 11:17:06 2025 -0700

    Merge pull request kubernetes#131263 from aojea/dualstack_upgrade

    Allow to convert clusters Service CIDRs from single to dual stack

commit 0266d3b
Author: Antonio Ojea <aojea@google.com>
Date:   Fri Apr 11 15:37:28 2025 +0000

    Allow single-to-dual-stack reconfiguration for ServiceCIDR

    This change modifies the validation logic for ServiceCIDR updates
    (`ValidateServiceCIDRUpdate`) to specifically permit upgrading a
    single-stack ServiceCIDR (either IPv4 or IPv6) to a dual-stack
    configuration.

    This reconfiguration path is considered safe because it only involves adding
    a new CIDR range without altering the existing primary CIDR. This
    ensures that existing Service IP allocations are not disrupted.

    Other modifications, such as:
    - Downgrading from dual-stack to single-stack
    - Reordering CIDRs in a dual-stack configuration
    - Changing the primary CIDR during a single-to-dual-stack
      reconfiguration

    remain disallowed by the validation. These operations carry a higher
    risk of breaking existing Services or cluster networking
    configurations. Preventing these updates automatically encourages
    administrators to perform such changes manually after carefully
    assessing the potential impact on their specific cluster environment.
    The validation errors and controller logs provide guidance when such
    disallowed changes are attempted.

    Change-Id: I41dc09dfddb05f277925da2262f8114d6accbd1d

commit cfed8c7
Merge: d6e3f34 7d7fc2d
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Mon Apr 14 08:19:13 2025 -0700

    Merge pull request kubernetes#131234 from carlory/fix-131229

    Fix flaky test: Metrics should grab all metrics from kubelet /metrics/resource endpoint

commit d6e3f34
Merge: b15dfce 18249aa
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Mon Apr 14 08:19:06 2025 -0700

    Merge pull request kubernetes#131211 from BenTheElder/max-pods

    fix flaky garbage collector tests

commit b15dfce
Merge: 908bdb3 e5a5f72
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Fri Apr 11 04:46:42 2025 -0700

    Merge pull request kubernetes#131240 from jsafrane/selinux-tests-tag

    Tag SELinux tests that require SELinux warning controller

commit 7d7fc2d
Author: carlory <baofa.fan@daocloud.io>
Date:   Thu Apr 10 17:58:05 2025 +0800

    Fix flaky test: Metrics should grab all metrics from kubelet /metrics/resource endpoint

    Signed-off-by: carlory <baofa.fan@daocloud.io>

commit 908bdb3
Merge: 88dfcb2 505836c
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Thu Apr 10 17:08:48 2025 -0700

    Merge pull request kubernetes#131250 from jimangel/publishing-release-33

    staging/publishing: add release-1.33 rules

commit 505836c
Author: Jim Angel <jameswangel@gmail.com>
Date:   Thu Apr 10 16:42:44 2025 -0500

    staging/publishing: add release-1.33 rules

commit e5a5f72
Author: Jan Safranek <jsafrane@redhat.com>
Date:   Thu Apr 10 14:28:25 2025 +0200

    Tag SELinux tests that require SELinux warning controller

    The controller is available only with SELinuxChangePolicy FG enabled.
    Kubernetes e2e jobs do the right --focus & --skip explicitly in the SELinux
    jobs, this just helps other test runners to figure out what FGs are needed
    to run the tests.

    When at it, I noticed SELinuxMountReadWriteOncePod tag is always required, so
    add it on the common level and not in each test.

commit 88dfcb2
Merge: cacd595 52298cf
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Wed Apr 9 07:06:48 2025 -0700

    Merge pull request kubernetes#131065 from pohly/dra-kubelet-registration-unit-test-fix

    DRA kubelet: fix potential flake in unit test

commit cacd595
Author: Kubernetes Release Robot <k8s-release-robot@users.noreply.github.com>
Date:   Tue Apr 8 19:22:49 2025 +0000

    CHANGELOG: Update directory for v1.33.0-rc.0 release

commit 18249aa
Author: Benjamin Elder <bentheelder@google.com>
Date:   Tue Apr 8 15:54:45 2025 -0700

    hack/update-conformance-yaml.sh

    adds [Serial] tags to some tests that schedule lots of pods

commit 640489a
Merge: 92af6ab b1a9cc3
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Tue Apr 8 15:12:51 2025 -0700

    Merge pull request kubernetes#131196 from siyuanfoundation/forward-api

    bug fix: fix version order in emulation forward compatibility.

commit 1eab303
Author: Benjamin Elder <bentheelder@google.com>
Date:   Tue Apr 8 14:46:38 2025 -0700

    mark tests that use estimateMaxPods as serial

commit b2933c0
Author: Benjamin Elder <bentheelder@google.com>
Date:   Tue Apr 8 14:46:06 2025 -0700

    estimate some system daemonset overhead for max pods

commit b1a9cc3
Author: Siyuan Zhang <sizhang@google.com>
Date:   Mon Apr 7 18:38:43 2025 -0700

    bug fix: fix version order in emulation forward compatibility.

    Signed-off-by: Siyuan Zhang <sizhang@google.com>

commit 92af6ab
Merge: ab3e83f 2ef4a84
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Tue Apr 8 12:18:44 2025 -0700

    Merge pull request kubernetes#131204 from dims/move-to-released-version-of-prometheus/client_golang-v1.22.0-from-rc.0

    Move to released version of prometheus/client_golang v1.22.0 from rc.0

commit 2ef4a84
Author: Davanum Srinivas <davanum@gmail.com>
Date:   Tue Apr 8 08:35:18 2025 -0400

    Move to released version of prometheus/client_golang v1.22.0 from rc.0

    Signed-off-by: Davanum Srinivas <davanum@gmail.com>

commit ab3e83f
Merge: 195803c 10a7d6f
Author: Kubernetes Prow Robot <20407524+k8s-ci-robot@users.noreply.github.com>
Date:   Tue Apr 8 02:06:44 2025 -0700

    Merge pull request kubernetes#131146 from mauriciopoppe/csi-proxy-1-2-1

    Bump CSI Proxy to v1.2.1-gke.2

commit 10a7d6f
Author: Mauricio Poppe <mauriciopoppe@google.com>
Date:   Tue Apr 1 20:22:44 2025 +0000

    Update CSI Proxy to v1.2.1-gke.2

commit 52298cf
Author: Patrick Ohly <patrick.ohly@intel.com>
Date:   Wed Mar 26 09:21:50 2025 +0100

    DRA kubelet: fix potential flake in unit test

    If the test binary ran long enough after test completion to reach the
    ResourceSlice removal grace period, that background activity started and failed
    because it was using out-dated state and an invalid testing.T pointer, causing
    a panic.

    The root cause was to leave those background activities running. They need to
    be stopped before a test returns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance Issues or PRs related to kubernetes conformance tests area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky] Garbage collector should not delete dependents that have both valid owner and owner that's waiting for dependents to be deleted
7 participants