-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Job controller's race condition - Pod finalizer removal and job uncounted status update should work in separate reconcile #130103
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/sig apps |
The following is a unit-test to emulate and reproduce when the pod leakage happens: func TestJobControllerRaceCondition(t *testing.T) {
t.Cleanup(setDurationDuringTest(&syncJobBatchPeriod, fastSyncJobBatchPeriod))
logger, ctx := ktesting.NewTestContext(t)
job1 := newJob(1, 1, 6, batch.NonIndexedCompletion)
job1.Name = "job1"
clientSet := fake.NewSimpleClientset(job1)
//clientset := clientset.NewForConfigOrDie(&restclient.Config{Host: "", ContentConfig: restclient.ContentConfig{GroupVersion: &schema.GroupVersion{Group: "", Version: "v1"}}})
fakeClock := clocktesting.NewFakeClock(time.Now())
jm, informer := newControllerFromClientWithClock(ctx, t, clientSet, controller.NoResyncPeriodFunc, fakeClock)
jm.podControl = &controller.RealPodControl{
KubeClient: clientSet,
Recorder: testutil.NewFakeRecorder(),
}
jm.podStoreSynced = alwaysReady
jm.jobStoreSynced = alwaysReady
informer.Batch().V1().Jobs().Informer().GetIndexer().Add(job1)
// 1st reconcile should create a new pod
err := jm.syncJob(ctx, testutil.GetKey(job1, t))
time.Sleep(time.Second)
podIndexer := informer.Core().V1().Pods().Informer().GetIndexer()
assert.NotNil(t, podIndexer)
podList, err := clientSet.Tracker().List(
schema.GroupVersionResource{Version: "v1", Resource: "pods"},
schema.GroupVersionKind{Version: "v1", Kind: "Pod"},
"default")
assert.NotNil(t, podList)
// manually adding the just-created pod from fake clientset memory to informer cache because informer is not started.
// meanwhile, intentionally not adding the job status update to the informer cache to simulate the situation when the job watch event delayed.
justCreatedPod := podList.(*v1.PodList).Items[0]
justCreatedPod.Finalizers = nil
justCreatedPod.Status.Phase = v1.PodSucceeded
podIndexer.Add(&justCreatedPod)
jm.addPod(logger, &justCreatedPod)
// removing the just created pod from fake clientset memory, but the pod will remain inside informer cache
clientSet.Tracker().Delete(
schema.GroupVersionResource{Version: "v1", Resource: "pods"},
"default", "")
// 2nd reconcile
err = jm.syncJob(ctx, testutil.GetKey(job1, t))
time.Sleep(time.Second)
podList, err = clientSet.Tracker().List(
schema.GroupVersionResource{Version: "v1", Resource: "pods"},
schema.GroupVersionKind{Version: "v1", Kind: "Pod"},
"default")
// no pod should be created, but it will fail here
assert.True(t, podList.(*v1.PodList).Items == 0)
} |
cc @kmala |
What happened?
Job controller accidentally created two pods even if the Job spec specifies in a busy cluster:
This is happening because when job controller is calculating the succeed pods, it's taking three inputs (here):
Job.status.succeeded
Job.status.uncountedTerminatedPods
Due to the current implementation where the job controller is refreshing both (2) and (3) in the same reconcile/sync process:
The job controller may miscount the succeeded pods when the watch event for (3) hasn't reach the controller. A delayed watch event is fairly likely to happened in a busy cluster.
What did you expect to happen?
The better implementation should be handling the refreshing of (2) and (3) in separate reconcile process:
(1) in the 1st reconcile, job controller should only refresh
Job.status.uncountedTerminatedPods
while preserving the finalizers on the pods(2) in the 2nd reconcile which is triggered by (1), job controller is safe to remove the finalizers from pods
How can we reproduce it (as minimally and precisely as possible)?
Any trivial job with 1 completion on a busy cluster should be able to reproduce the issue
Anything else we need to know?
No response
Kubernetes version
All k8s versions I assume
Cloud provider
Any
OS version
Any
Install tools
No response
Container runtime (CRI) and version (if applicable)
No response
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
The text was updated successfully, but these errors were encountered: