Skip to content

Separate backOff policy for static pod #126019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

llhhbc
Copy link
Contributor

@llhhbc llhhbc commented Jul 11, 2024

Most static pods run as critical components. When an exception occurs and a restart is required, the sooner the better, so a separate backoff policy is set for static pods.

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 11, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 11, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @llhhbc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 11, 2024
@k8s-ci-robot k8s-ci-robot requested review from feiskyer and odinuge July 11, 2024 03:57
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 11, 2024
@llhhbc llhhbc force-pushed the add-static-back-off-limit branch from 355f76b to 67ab247 Compare July 11, 2024 04:02
@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 12, 2024
@kannon92
Copy link
Contributor

kannon92 commented Jul 12, 2024

I think this is an interesting feature but not sure this implementation is the right path forward.

If we wanted to control BackOffzlimit for static pods I wonder if this should be an API in the pod spec..

I would consider a KEP for this as I don’t think annotations would be a good path forward.

@llhhbc
Copy link
Contributor Author

llhhbc commented Jul 15, 2024

I think this is an interesting feature but not sure this implementation is the right path forward.

If we wanted to control BackOffzlimit for static pods I wonder if this should be an API in the pod spec..

I would consider a KEP for this as I don’t think annotations would be a good path forward.

@kannon92 How is this planned? Is this field reflected in the pod spec? Or in priorityClassName?

@kannon92
Copy link
Contributor

Pod spec but I think you’d want to bring this up to sig node as API changes usually require some buyin

@llhhbc
Copy link
Contributor Author

llhhbc commented Jul 15, 2024

@kannon92 How should I submit?

@kannon92
Copy link
Contributor

I’d suggest taking this PR to sig-node meeting and see what the consensus is.

@llhhbc
Copy link
Contributor Author

llhhbc commented Jul 18, 2024

@kannon92 I don't know how to proceed with this? I should not have relevant permissions, can you initiate a relevant application?

@llhhbc llhhbc force-pushed the add-static-back-off-limit branch 2 times, most recently from db0f65c to fb39754 Compare July 18, 2024 03:02
@bart0sh
Copy link
Contributor

bart0sh commented Jul 29, 2024

@llhhbc please, present this on SIG-Node meeting as @kannon92 suggested, thanks.

@kannon92
Copy link
Contributor

/hold

@k8s-ci-robot
Copy link
Contributor

@llhhbc: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: llhhbc
Once this PR has been reviewed and has the lgtm label, please assign tallclair for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@llhhbc
Copy link
Contributor Author

llhhbc commented Feb 16, 2025

@bart0sh Sorry, I don't know how to present this on [SIG-Node meeting]

@bart0sh
Copy link
Contributor

bart0sh commented Feb 27, 2025

@bart0sh Sorry, I don't know how to present this on [SIG-Node meeting]

You just need to call into the meeting and explain your idea there. Updating agenda would be nice to do as well.

@bart0sh bart0sh moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Feb 27, 2025
@llhhbc
Copy link
Contributor Author

llhhbc commented Feb 28, 2025

@bart0sh I don't have the corresponding permissions. Can you help initiate the application? Thanks

@bart0sh
Copy link
Contributor

bart0sh commented Feb 28, 2025

@bart0sh I don't have the corresponding permissions. Can you help initiate the application? Thanks

Sure, I can update meeting agenda for the next meeting on your behalf. Can you show me what to put there? Your name and a topic should be enough. Is next meeting(next Tuesday 10 AM PT) ok for you?

@bart0sh
Copy link
Contributor

bart0sh commented Feb 28, 2025

@llhhbc BTW, rebasing your changes would make the PR look better.

Most static pods run as critical components. When an exception occurs and a restart is required, the sooner the better, so a separate backoff policy is set for static pods.
@llhhbc llhhbc force-pushed the add-static-back-off-limit branch from fb39754 to 502715b Compare March 3, 2025 06:40
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 3, 2025
@llhhbc
Copy link
Contributor Author

llhhbc commented Mar 3, 2025

@bart0sh Thanks, I have rebase the commit.
I'm sorry that my English ability may not allow me to attend the meeting and explain directly, but maybe I can go to the meeting and listen in to learn more. I'll tell you the detailed description of the problem: kubelet does not distinguish the restart strategy separately for static, because the pods running statically are generally more important pods, such as apiserver, etcd, etc. The restart interval of such containers will seriously affect the fault recovery time of the entire platform. Because the recovery of all pods depends on the recovery of the apiserver. And I saw that the new version of k8s has added features.KubeletCrashLoopBackOffMax, but this is the startup parameter configuration of kubelet, which is too rough, because not all pods require such frequent restart intervals. Can the configuration be applied to the pod?
I am not free at the meeting time(Tuesday 10 AM PT), Because it was early morning.

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 14, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-project-automation github-project-automation bot moved this from Waiting on Author to Done in SIG Node: code and documentation PRs Apr 13, 2025
@llhhbc
Copy link
Contributor Author

llhhbc commented Apr 14, 2025

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Apr 14, 2025
@k8s-ci-robot
Copy link
Contributor

@llhhbc: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.33 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.33.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Apr 14 01:30:02 UTC 2025.

@llhhbc
Copy link
Contributor Author

llhhbc commented Apr 14, 2025

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 14, 2025
@bart0sh bart0sh moved this from Triage to Waiting on Author in SIG Node: code and documentation PRs Jun 10, 2025
@dims dims added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

6 participants