Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job controller's race condition - Pod finalizer removal and job uncounted status update should work in separate reconcile #130103

Open
yue9944882 opened this issue Feb 12, 2025 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.

Comments

@yue9944882
Copy link
Member

What happened?

Job controller accidentally created two pods even if the Job spec specifies in a busy cluster:

  parallelism: 1
  completions: 1
  activeDeadlineSeconds: 86400
  backoffLimit: 0

This is happening because when job controller is calculating the succeed pods, it's taking three inputs (here):

  1. The completion count from Job.status.succeeded
  2. The existing succeed pods (if the finalizer is still present)
  3. The uncounted completion count from Job.status.uncountedTerminatedPods

Due to the current implementation where the job controller is refreshing both (2) and (3) in the same reconcile/sync process:

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/job/job_controller.go#L1165-L1167

The job controller may miscount the succeeded pods when the watch event for (3) hasn't reach the controller. A delayed watch event is fairly likely to happened in a busy cluster.

What did you expect to happen?

The better implementation should be handling the refreshing of (2) and (3) in separate reconcile process:

(1) in the 1st reconcile, job controller should only refresh Job.status.uncountedTerminatedPods while preserving the finalizers on the pods
(2) in the 2nd reconcile which is triggered by (1), job controller is safe to remove the finalizers from pods

How can we reproduce it (as minimally and precisely as possible)?

Any trivial job with 1 completion on a busy cluster should be able to reproduce the issue

Anything else we need to know?

No response

Kubernetes version

All k8s versions I assume

Cloud provider

Any

OS version

Any

Install tools

No response

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

@yue9944882 yue9944882 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 12, 2025
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 12, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 12, 2025
@yue9944882
Copy link
Member Author

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 12, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Feb 12, 2025
@yue9944882
Copy link
Member Author

yue9944882 commented Feb 12, 2025

The following is a unit-test to emulate and reproduce when the pod leakage happens:

func TestJobControllerRaceCondition(t *testing.T) {
	t.Cleanup(setDurationDuringTest(&syncJobBatchPeriod, fastSyncJobBatchPeriod))
	logger, ctx := ktesting.NewTestContext(t)
	job1 := newJob(1, 1, 6, batch.NonIndexedCompletion)
	job1.Name = "job1"
	clientSet := fake.NewSimpleClientset(job1)
	//clientset := clientset.NewForConfigOrDie(&restclient.Config{Host: "", ContentConfig: restclient.ContentConfig{GroupVersion: &schema.GroupVersion{Group: "", Version: "v1"}}})
	fakeClock := clocktesting.NewFakeClock(time.Now())
	jm, informer := newControllerFromClientWithClock(ctx, t, clientSet, controller.NoResyncPeriodFunc, fakeClock)
	jm.podControl = &controller.RealPodControl{
		KubeClient: clientSet,
		Recorder:   testutil.NewFakeRecorder(),
	}
	jm.podStoreSynced = alwaysReady
	jm.jobStoreSynced = alwaysReady
 
	informer.Batch().V1().Jobs().Informer().GetIndexer().Add(job1)
	// 1st reconcile should create a new pod
	err := jm.syncJob(ctx, testutil.GetKey(job1, t))
	time.Sleep(time.Second)
 
	podIndexer := informer.Core().V1().Pods().Informer().GetIndexer()
	assert.NotNil(t, podIndexer)
	podList, err := clientSet.Tracker().List(
		schema.GroupVersionResource{Version: "v1", Resource: "pods"},
		schema.GroupVersionKind{Version: "v1", Kind: "Pod"},
		"default")
	assert.NotNil(t, podList)
	// manually adding the just-created pod from fake clientset memory to informer cache because informer is not started.
        // meanwhile, intentionally not adding the job status update to the informer cache to simulate the situation when the job watch event delayed.
	justCreatedPod := podList.(*v1.PodList).Items[0]
	justCreatedPod.Finalizers = nil
	justCreatedPod.Status.Phase = v1.PodSucceeded
	podIndexer.Add(&justCreatedPod)
	jm.addPod(logger, &justCreatedPod)
	
	// removing the just created pod from fake clientset memory, but the pod will remain inside informer cache
	clientSet.Tracker().Delete(		
		schema.GroupVersionResource{Version: "v1", Resource: "pods"},
		"default", "")
	// 2nd reconcile
	err = jm.syncJob(ctx, testutil.GetKey(job1, t))
	time.Sleep(time.Second)
	podList, err = clientSet.Tracker().List(
		schema.GroupVersionResource{Version: "v1", Resource: "pods"},
		schema.GroupVersionKind{Version: "v1", Kind: "Pod"},
		"default")
 
	// no pod should be created, but it will fail here
	assert.True(t, podList.(*v1.PodList).Items == 0)
}

@jqmichael
Copy link
Contributor

CC: @liggitt @dims

@liggitt liggitt added this to @liggitt Feb 12, 2025
@yue9944882
Copy link
Member Author

cc @kmala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps.
Projects
Status: No status
Status: Needs Triage
Development

No branches or pull requests

3 participants