Job controller's race condition - Pod finalizer removal and job uncounted status update should work in separate reconcile #130103

yue9944882 · 2025-02-12T01:21:02Z

What happened?

Job controller accidentally created two pods even if the Job spec specifies in a busy cluster:

  parallelism: 1
  completions: 1
  activeDeadlineSeconds: 86400
  backoffLimit: 0

This is happening because when job controller is calculating the succeed pods, it's taking three inputs (here):

The completion count from Job.status.succeeded
The existing succeed pods (if the finalizer is still present)
The uncounted completion count from Job.status.uncountedTerminatedPods

Due to the current implementation where the job controller is refreshing both (2) and (3) in the same reconcile/sync process:

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L1165-L1167

The job controller may miscount the succeeded pods when the watch event for (3) hasn't reach the controller. A delayed watch event is fairly likely to happened in a busy cluster.

What did you expect to happen?

The better implementation should be handling the refreshing of (2) and (3) in separate reconcile process:

(1) in the 1st reconcile, job controller should only refresh Job.status.uncountedTerminatedPods while preserving the finalizers on the pods
(2) in the 2nd reconcile which is triggered by (1), job controller is safe to remove the finalizers from pods

How can we reproduce it (as minimally and precisely as possible)?

Any trivial job with 1 completion on a busy cluster should be able to reproduce the issue

Anything else we need to know?

No response

Kubernetes version

All k8s versions I assume

Cloud provider

Any

OS version

Any

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2025-02-12T01:21:11Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

yue9944882 · 2025-02-12T01:22:45Z

/sig apps

yue9944882 · 2025-02-12T19:21:28Z

The following is a unit-test to emulate and reproduce when the pod leakage happens:

func TestJobControllerRaceCondition(t *testing.T) {
	t.Cleanup(setDurationDuringTest(&syncJobBatchPeriod, fastSyncJobBatchPeriod))
	logger, ctx := ktesting.NewTestContext(t)
	job1 := newJob(1, 1, 6, batch.NonIndexedCompletion)
	job1.Name = "job1"
	clientSet := fake.NewSimpleClientset(job1)
	//clientset := clientset.NewForConfigOrDie(&restclient.Config{Host: "", ContentConfig: restclient.ContentConfig{GroupVersion: &schema.GroupVersion{Group: "", Version: "v1"}}})
	fakeClock := clocktesting.NewFakeClock(time.Now())
	jm, informer := newControllerFromClientWithClock(ctx, t, clientSet, controller.NoResyncPeriodFunc, fakeClock)
	jm.podControl = &controller.RealPodControl{
		KubeClient: clientSet,
		Recorder:   testutil.NewFakeRecorder(),
	}
	jm.podStoreSynced = alwaysReady
	jm.jobStoreSynced = alwaysReady
 
	informer.Batch().V1().Jobs().Informer().GetIndexer().Add(job1)
	// 1st reconcile should create a new pod
	err := jm.syncJob(ctx, testutil.GetKey(job1, t))
	time.Sleep(time.Second)
 
	podIndexer := informer.Core().V1().Pods().Informer().GetIndexer()
	assert.NotNil(t, podIndexer)
	podList, err := clientSet.Tracker().List(
		schema.GroupVersionResource{Version: "v1", Resource: "pods"},
		schema.GroupVersionKind{Version: "v1", Kind: "Pod"},
		"default")
	assert.NotNil(t, podList)
	// manually adding the just-created pod from fake clientset memory to informer cache because informer is not started.
        // meanwhile, intentionally not adding the job status update to the informer cache to simulate the situation when the job watch event delayed.
	justCreatedPod := podList.(*v1.PodList).Items[0]
	justCreatedPod.Finalizers = nil
	justCreatedPod.Status.Phase = v1.PodSucceeded
	podIndexer.Add(&justCreatedPod)
	jm.addPod(logger, &justCreatedPod)
	
	// removing the just created pod from fake clientset memory, but the pod will remain inside informer cache
	clientSet.Tracker().Delete(		
		schema.GroupVersionResource{Version: "v1", Resource: "pods"},
		"default", "")
	// 2nd reconcile
	err = jm.syncJob(ctx, testutil.GetKey(job1, t))
	time.Sleep(time.Second)
	podList, err = clientSet.Tracker().List(
		schema.GroupVersionResource{Version: "v1", Resource: "pods"},
		schema.GroupVersionKind{Version: "v1", Kind: "Pod"},
		"default")
 
	// no pod should be created, but it will fail here
	assert.True(t, podList.(*v1.PodList).Items == 0)
}

jqmichael · 2025-02-12T20:43:31Z

CC: @liggitt @dims

yue9944882 · 2025-02-18T18:19:23Z

cc @kmala

yue9944882 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2025

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 12, 2025

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 12, 2025

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 12, 2025

github-project-automation bot added this to SIG Apps Feb 12, 2025

github-project-automation bot moved this to Needs Triage in SIG Apps Feb 12, 2025

liggitt added this to @liggitt Feb 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job controller's race condition - Pod finalizer removal and job uncounted status update should work in separate reconcile #130103

Job controller's race condition - Pod finalizer removal and job uncounted status update should work in separate reconcile #130103

yue9944882 commented Feb 12, 2025

k8s-ci-robot commented Feb 12, 2025

yue9944882 commented Feb 12, 2025

yue9944882 commented Feb 12, 2025 •

edited

Loading

jqmichael commented Feb 12, 2025

yue9944882 commented Feb 18, 2025

Job controller's race condition - Pod finalizer removal and job uncounted status update should work in separate reconcile #130103

Job controller's race condition - Pod finalizer removal and job uncounted status update should work in separate reconcile #130103

Comments

yue9944882 commented Feb 12, 2025

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented Feb 12, 2025

yue9944882 commented Feb 12, 2025

yue9944882 commented Feb 12, 2025 • edited Loading

jqmichael commented Feb 12, 2025

yue9944882 commented Feb 18, 2025

yue9944882 commented Feb 12, 2025 •

edited

Loading