Skip to content

Fix startup probe worker termination for sidecar containers #133072

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

AadiDev005
Copy link

@AadiDev005 AadiDev005 commented Jul 19, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Fixes a bug where startup probe workers terminate incorrectly for sidecar containers with restartPolicy=Always when the pod has restartPolicy=Never. This causes main containers to remain stuck in "Initializing" state even after sidecar containers successfully restart.

Problem Description:

  • Pod with restartPolicy: Never
  • Sidecar container with restartPolicy: Always and startup probe
  • Sidecar fails initially, then restarts successfully
  • Bug: Main container never starts (stuck in Initializing)
  • Root cause: Probe worker terminates based only on pod restart policy

Solution:
Minimal fix adds container-level restart policy check only for init containers to the probe worker continuation logic, allowing workers to continue for sidecar containers while preserving existing behavior for regular containers.

Which issue(s) this PR is related to:

Fixes #132826

Special notes for your reviewer:

  • Minimal change: Single line addition with proper null safety checks and init container validation
  • No breaking changes: Existing behavior preserved for regular containers
  • Scoped to init containers only: Fix only applies to restartable init containers (sidecars)
  • Comprehensive testing: Unit tests cover both sidecar and regular container scenarios
  • E2E test included: Validates end-to-end functionality

Does this PR introduce a user-facing change?

Fixed a startup probe race condition that caused main containers to remain stuck in "Initializing" state when sidecar containers with startup probes failed initially but succeeded on restart in pods with restartPolicy=Never.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 19, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @AadiDev005. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/kubelet area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 19, 2025
@gjkim42
Copy link
Member

gjkim42 commented Jul 19, 2025

/cc
/priority important-soon
/ok-to-test

@k8s-ci-robot k8s-ci-robot requested a review from gjkim42 July 19, 2025 06:00
@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 19, 2025
@AadiDev005 AadiDev005 force-pushed the fix-sidecar-startup-probe branch from 2e2983c to 29f796b Compare July 19, 2025 06:26
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 19, 2025
Copy link
Member

@gjkim42 gjkim42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks mostly good at first glance 👍

left some comments. let's see if the e2e passes.

@@ -2126,6 +2126,133 @@ var _ = SIGDescribe(feature.SidecarContainers, "Containers Lifecycle", func() {
addAfterEachForCleaningUpPods(f)
f.NamespacePodSecurityLevel = admissionapi.LevelPrivileged

ginkgo.When("A sidecar container with startup probe fails initially", func() {
Copy link
Member

@gjkim42 gjkim42 Jul 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
Can we move this test to the bottom of "SidecarContainers Container Lifecycle" block, since this is a very edge case?

@@ -2126,6 +2126,133 @@ var _ = SIGDescribe(feature.SidecarContainers, "Containers Lifecycle", func() {
addAfterEachForCleaningUpPods(f)
f.NamespacePodSecurityLevel = admissionapi.LevelPrivileged

ginkgo.When("A sidecar container with startup probe fails initially", func() {
ginkgo.It("should continue probing and allow main container to start after sidecar recovers [NodeConformance]", func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the [NodeConformance] label here?
I think we don't need it for this edge case, and it is deprecated as well.

@@ -253,7 +253,9 @@ func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {
}
// Abort if the container will not be restarted.
return c.State.Terminated == nil ||
w.pod.Spec.RestartPolicy != v1.RestartPolicyNever
w.pod.Spec.RestartPolicy != v1.RestartPolicyNever ||
(w.container.RestartPolicy != nil && *w.container.RestartPolicy == v1.ContainerRestartPolicyAlways)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check this only if the container is an init container?
The usage of the container restart policy for a regular container is not defined yet.

@AadiDev005 AadiDev005 force-pushed the fix-sidecar-startup-probe branch from fc077e0 to bc4856a Compare August 3, 2025 11:24
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 3, 2025
@AadiDev005 AadiDev005 requested a review from gjkim42 August 3, 2025 11:28
@gjkim42
Copy link
Member

gjkim42 commented Aug 3, 2025

/test pull-kubernetes-node-kubelet-serial-containerd-sidecar-containers
/test pull-kubernetes-node-e2e-containerd-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features

@gjkim42
Copy link
Member

gjkim42 commented Aug 3, 2025

Thank you.
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 3, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: f45acaaa0c792d94f849ef3ab006a26ee7d9b01b

Copy link
Member

@gjkim42 gjkim42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @SergeyKanzhelev
Could you take a look at this?

This fixes the bug where a startup probe of a restartable init container may not be executed if the restartable init container restarts.

@gjkim42
Copy link
Member

gjkim42 commented Aug 3, 2025

/retest

1 similar comment
@gjkim42
Copy link
Member

gjkim42 commented Aug 4, 2025

/retest

@AadiDev005
Copy link
Author

Hi @gjkim42 / @SergeyKanzhelev,

The required pull-kubernetes-unit job is failing in the scheduler’s dynamic-resources tests:

pkg/scheduler/framework/plugins/dynamicresources
└─ TestPlugin/extended-resource-name-with-resources … FAIL
    expected ResourceClaim != nil

My patch only touches pkg/kubelet/prober/*, so I’m not sure whether this is:

  1. A side-effect of my branch being a bit behind master, or
  2. Something I need to fix directly.

Before I start rebasing and regenerating code, could you please confirm the expected next step?
Happy to follow any guidance.

Thanks!

@gjkim42
Copy link
Member

gjkim42 commented Aug 4, 2025

#133302
^
I think the test seems flaky.
(and it succeeds this time, fortunately)

(FYI: We cannot merge a PR unless all required tests succeed, so we have to retest them if it looks like a flaky test issue.)

@AadiDev005
Copy link
Author

@Random-Liu @tallclair @SergeyKanzhelev
All required tests are green and the fix is tagged priority/important-soon.
Could one of you please:
/milestone v1.34
/approve

@k8s-ci-robot
Copy link
Contributor

@AadiDev005: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Milestone Maintainers Team and have them propose you as an additional delegate for this responsibility.

In response to this:

@Random-Liu @tallclair @SergeyKanzhelev
All required tests are green and the fix is tagged priority/important-soon.
Could one of you please:
/milestone v1.34
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AadiDev005
Once this PR has been reviewed and has the lgtm label, please ask for approval from sergeykanzhelev. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gjkim42
Copy link
Member

gjkim42 commented Aug 4, 2025

The code freeze is already in effect and this doesn't look like a urgent fix, so let's merge this for the next cycle.

https://www.kubernetes.dev/resources/release/

Comment on lines 40 to 41
// newTestWorkerWithInitContainer creates a test worker with an init container setup
func newTestWorkerWithInitContainer(m *manager, probeType probeType, probe v1.Probe) *worker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be better to clarify that this is specifically about restartable init container.

Suggested change
// newTestWorkerWithInitContainer creates a test worker with an init container setup
func newTestWorkerWithInitContainer(m *manager, probeType probeType, probe v1.Probe) *worker {
// newTestWorkerWithRestartableInitContainer creates a test worker with an init container setup
func newTestWorkerWithRestartableInitContainer(m *manager, probeType probeType, probe v1.Probe) *worker {

Image: "registry.k8s.io/pause:3.9",
}}

return &worker{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the existing newWorker() here? That way, we won't need getInitialResult() anymore.

Something like this:

diff --git a/pkg/kubelet/prober/worker_test.go b/pkg/kubelet/prober/worker_test.go
index 1f6104179b1..d8ef9ae7462 100644
--- a/pkg/kubelet/prober/worker_test.go
+++ b/pkg/kubelet/prober/worker_test.go
@@ -38,7 +38,7 @@ func init() {
 }

 // newTestWorkerWithInitContainer creates a test worker with an init container setup
-func newTestWorkerWithInitContainer(m *manager, probeType probeType, probe v1.Probe) *worker {
+func newTestWorkerWithInitContainer(m *manager, probeType probeType) *worker {
        pod := getTestPod()

        // Set up init container with restart policy
@@ -50,19 +50,10 @@ func newTestWorkerWithInitContainer(m *manager, probeType probeType, probe v1.Pr
        // Move container to init containers and add a regular container
        pod.Spec.InitContainers = []v1.Container{initContainer}
        pod.Spec.Containers = []v1.Container{{
-               Name:  "main-container",
-               Image: "registry.k8s.io/pause:3.9",
+               Name: "main-container",
        }}

-       return &worker{
-               pod:            pod,
-               container:      initContainer,
-               probeType:      probeType,
-               spec:           &probe,
-               probeManager:   m,
-               resultsManager: resultsManager(m, probeType),
-               initialValue:   getInitialResult(probeType),
-       }
+       return newWorker(m, probeType, pod, initContainer)
 }

pod.Spec.InitContainers = []v1.Container{initContainer}
pod.Spec.Containers = []v1.Container{{
Name: "main-container",
Image: "registry.k8s.io/pause:3.9",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since the image field isn't used in the test, it can be removed.

Suggested change
Image: "registry.k8s.io/pause:3.9",

@@ -253,7 +263,8 @@ func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {
}
// Abort if the container will not be restarted.
return c.State.Terminated == nil ||
w.pod.Spec.RestartPolicy != v1.RestartPolicyNever
w.pod.Spec.RestartPolicy != v1.RestartPolicyNever ||
(w.isInitContainer() && w.container.RestartPolicy != nil && *w.container.RestartPolicy == v1.ContainerRestartPolicyAlways)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't feel strongly about this, but for readability, it might be worth giving this condition a name.

diff --git a/pkg/kubelet/prober/worker.go b/pkg/kubelet/prober/worker.go
index e71fd1b5687..4948cdb4236 100644
--- a/pkg/kubelet/prober/worker.go
+++ b/pkg/kubelet/prober/worker.go
@@ -261,10 +261,14 @@ func (w *worker) doProbe(ctx context.Context) (keepGoing bool) {
                if !w.containerID.IsEmpty() {
                        w.resultsManager.Set(w.containerID, results.Failure, w.pod)
                }
+
+               isRestartableInitContainer := w.isInitContainer() &&
+                       w.container.RestartPolicy != nil && *w.container.RestartPolicy == v1.ContainerRestartPolicyAlways
+
                // Abort if the container will not be restarted.
                return c.State.Terminated == nil ||
                        w.pod.Spec.RestartPolicy != v1.RestartPolicyNever ||
-                       (w.isInitContainer() && w.container.RestartPolicy != nil && *w.container.RestartPolicy == v1.ContainerRestartPolicyAlways)
+                       isRestartableInitContainer
        }

        // Graceful shutdown of the pod.

@@ -530,6 +572,72 @@ func TestResultRunOnStartupCheckFailure(t *testing.T) {
}
}

func TestDoProbe_TerminatedRestartableInitContainerWithRestartPolicyNone(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
func TestDoProbe_TerminatedRestartableInitContainerWithRestartPolicyNone(t *testing.T) {
func TestDoProbe_TerminatedRestartableInitContainerWithRestartPolicyNever(t *testing.T) {

// Test: Terminated restartable init container with restartPolicy=Always should continue probing
// even when pod has restartPolicy=Never
if !w.doProbe(ctx) {
t.Error("Expected to continue probing for terminated restartable init container with restart policy Always")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
t.Error("Expected to continue probing for terminated restartable init container with restart policy Always")
t.Error("Expected to continue probing for terminated restartable init container with pod restart policy Never")

}

// Verify result is set to Failure for terminated container
expectResult(t, w, results.Failure, "terminated restartable init container with restart policy Always")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
expectResult(t, w, results.Failure, "terminated restartable init container with restart policy Always")
expectResult(t, w, results.Failure, "restartable init container with pod restart policy Never")

Fixes a bug where startup probe workers terminate incorrectly for sidecar
containers with restartPolicy=Always when the pod has restartPolicy=Never,
causing main containers to remain stuck in "Initializing" state.

Changes:
- Add container-level restart policy check for init containers only
- Extract complex boolean logic to named variable for readability
- Refactor test helper to use existing newWorker() function
- Fix terminology: RestartPolicyNone → RestartPolicyNever
- Add comprehensive unit and e2e tests for both scenarios
@AadiDev005 AadiDev005 force-pushed the fix-sidecar-startup-probe branch from bc4856a to 0a26f03 Compare August 5, 2025 06:42
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 5, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@AadiDev005
Copy link
Author

Hey @toVersus, thanks so much for the thorough review! 🙏

I've implemented all your suggestions and squashed everything into a clean single commit:

• Renamed the function to be clearer about restartable init containers
• Switched to using the existing newWorker() (great catch on the code duplication!)
• Cleaned up the unused image field and helper function
• Fixed the terminology from None to Never to match Kubernetes conventions
• Made that complex boolean condition much more readable with a proper variable name

Really appreciate you taking the time to make the code cleaner and more maintainable. The refactoring suggestions especially make a lot of sense - no need to reinvent the wheel when newWorker() already does what we need!
😊

@gjkim42 - can u take a look for a moment

@AadiDev005 AadiDev005 requested a review from toVersus August 5, 2025 06:47
@gjkim42
Copy link
Member

gjkim42 commented Aug 5, 2025

/test pull-kubernetes-node-e2e-containerd-features
/test pull-kubernetes-cos-cgroupv2-containerd-node-e2e-features
/test pull-kubernetes-cos-cgroupv1-containerd-node-e2e-features

@k8s-ci-robot
Copy link
Contributor

@AadiDev005: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-unit-windows-master 0a26f03 link false /test pull-kubernetes-unit-windows-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@AadiDev005
Copy link
Author

The Windows unit test failure in pull-kubernetes-unit-windows-master is unrelated to the sidecar startup probe fix, which only touches Linux kubelet probe logic in pkg/kubelet/prober/*. With 15/16 successful checks including all Linux-specific node tests, the sidecar fix is working correctly.

@SergeyKanzhelev @gjkim42 All code review feedback addressed and Linux tests green. Ready for approval when the freeze lifts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: PRs Waiting on Author
Status: Waiting on Author
Development

Successfully merging this pull request may close these issues.

main container never starts with restartPolicy: Never and sidecar container has a startupProbe and fails the first attempt
5 participants