Skip to content

fix: data race in UnschedulablePods by adding lock #132043

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

googs1025
Copy link
Member

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • fix data race in UnschedulablePods by adding lock

Which issue(s) this PR fixes:

Fixes #132025

Special notes for your reviewer:

Does this PR introduce a user-facing change?

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 31, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label May 31, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: googs1025
Once this PR has been reviewed and has the lgtm label, please assign macsko for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from dom4ha and macsko May 31, 2025 01:18
@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 31, 2025
@googs1025
Copy link
Member Author

in handleSchedulingFailure func, we We might call handleBindingCycleError in another goroutine.

func (sched *Scheduler) handleBindingCycleError(
ctx context.Context,
state fwk.CycleState,
fwk framework.Framework,
podInfo *framework.QueuedPodInfo,
start time.Time,
scheduleResult ScheduleResult,
status *framework.Status) {
logger := klog.FromContext(ctx)
assumedPod := podInfo.Pod
// trigger un-reserve plugins to clean up state associated with the reserved Pod
fwk.RunReservePluginsUnreserve(ctx, state, assumedPod, scheduleResult.SuggestedHost)
if forgetErr := sched.Cache.ForgetPod(logger, assumedPod); forgetErr != nil {
logger.Error(forgetErr, "scheduler cache ForgetPod failed")
} else {
// "Forget"ing an assumed Pod in binding cycle should be treated as a PodDelete event,
// as the assumed Pod had occupied a certain amount of resources in scheduler cache.
//
// Avoid moving the assumed Pod itself as it's always Unschedulable.
// It's intentional to "defer" this operation; otherwise MoveAllToActiveOrBackoffQueue() would
// add this event to in-flight events and thus move the assumed pod to backoffQ anyways if the plugins don't have appropriate QueueingHint.
if status.IsRejected() {
defer sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(logger, framework.EventAssignedPodDelete, assumedPod, nil, func(pod *v1.Pod) bool {
return assumedPod.UID != pod.UID
})
} else {
sched.SchedulingQueue.MoveAllToActiveOrBackoffQueue(logger, framework.EventAssignedPodDelete, assumedPod, nil, nil)
}
}
sched.FailureHandler(ctx, fwk, podInfo, status, clearNominatedNode, start)
}

@@ -1363,6 +1363,8 @@ func (p *PriorityQueue) newQueuedPodInfo(pod *v1.Pod, plugins ...string) *framew
// UnschedulablePods holds pods that cannot be scheduled. This data structure
// is used to implement unschedulablePods.
type UnschedulablePods struct {
// lock synchronizes access to podInfoMap and ensures thread-safe operations.
Copy link
Member Author

@googs1025 googs1025 May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it like: activeQ and backoffQ We need to use a read-write lock at the bottom 🤔

@macsko
Copy link
Member

macsko commented Jun 2, 2025

Have you tested if this change fixes the race?

@googs1025 googs1025 force-pushed the fix/scheduler/data_race branch 2 times, most recently from 55a771e to 66d4f4c Compare June 2, 2025 11:53
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 2, 2025
@googs1025
Copy link
Member Author

thanks for @macsko

I used GOFLAGS="-race" make test-integration WHAT=test/integration/scheduler/preemption KUBE_TEST_ARGS="-test.run TestPreemption/basic_pod_preemption_with_preFilter_" to test this change, and it seems that the data race still occurs.

If I understand correctly, it seems that adding a read-write lock can solve this problem(but not). Could you give me some suggestions to solve this problem? 🤔

@googs1025 googs1025 force-pushed the fix/scheduler/data_race branch from 66d4f4c to 8feedbb Compare June 2, 2025 12:17
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 2, 2025
@macsko
Copy link
Member

macsko commented Jun 2, 2025

I see the race might be different. It's between this line:

podInfo.PodInfo, _ = framework.NewPodInfo(cachedPod.DeepCopy())

And this:
p.unschedulablePods.delete(pInfo.Pod, gated)

Precisely, it's a race on writing to podInfo.PodInfo in the first and reading pInfo.Pod in the second. It's interesting, because the first writes to a pod info that can't be in a scheduling queue in the second. Probably, it's a very unlikely race when:

  1. Scheduler receives pod update and calls SchedulingQueue.Update() with it
  2. Pod is popped from the scheduling queue (either activeQ or backoffQ)
  3. Pod is unschedulable and goes to handleSchedulingFailure (first - schedule_one.go#L1069), but in the same time SchedulingQueue.Update() is on scheduling_queue.go#L1020 - RACE

You could check if this is the case. But if it is, I'm not sure how we could fix this.

@googs1025 googs1025 force-pushed the fix/scheduler/data_race branch from 8feedbb to f16f0b2 Compare June 2, 2025 13:48
Signed-off-by: googs1025 <googs1025@gmail.com>
@googs1025 googs1025 force-pushed the fix/scheduler/data_race branch from f16f0b2 to 760b454 Compare June 2, 2025 14:01
@@ -1017,7 +1017,7 @@ func (p *PriorityQueue) Update(logger klog.Logger, oldPod, newPod *v1.Pod) {
queue := p.requeuePodWithQueueingStrategy(logger, pInfo, hint, evt.Label())
if queue != unschedulablePods {
logger.V(5).Info("Pod moved to an internal scheduling queue because the Pod is updated", "pod", klog.KObj(newPod), "event", evt.Label(), "queue", queue)
p.unschedulablePods.delete(pInfo.Pod, gated)
p.unschedulablePods.delete(pInfo.GetPodCopy(), gated)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit strange, we already use deepcopy which I understand should avoid data race 🤔

Copy link
Member

@macsko macsko Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race is on access to pInfo.Pod field. If you try to copy it, you reference it, causing race.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, using pInfo.GetPodCopy() could not avoid a data race since it accesses pInfo.Pod as well?

func (pi *PodInfo) GetPodCopy() *v1.Pod {
	if pi.Pod == nil {
		return nil
	}
	return pi.Pod.DeepCopy().  <-- Access to `pInfo.Pod`
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@googs1025
Copy link
Member Author

Precisely, it's a race on writing to podInfo.PodInfo in the first and reading pInfo.Pod in the second. It's interesting, because the first writes to a pod info that can't be in a scheduling queue in the second. Probably, it's a very unlikely race when:

Scheduler receives pod update and calls SchedulingQueue.Update() with it
Pod is popped from the scheduling queue (either activeQ or backoffQ)
Pod is unschedulable and goes to handleSchedulingFailure (first - schedule_one.go#L1069), but in the same time SchedulingQueue.Update() is on scheduling_queue.go#L1020 - RACE

Looking at the code, it seems to be the reason you described 🤔

@likakuli
Copy link
Member

It seems that these two lines have multiple data races, including reads and writes to podInfo.PodInfo, reads and writes to the podInfo.Pod itself, as well as reads and writes to the pod's properties.

@macsko
Copy link
Member

macsko commented Jun 23, 2025

@googs1025 I found a fix for this race. See #132451

@googs1025
Copy link
Member Author

/close

There is a better way #132451

@k8s-ci-robot
Copy link
Contributor

@googs1025: Closed this PR.

In response to this:

/close

There is a better way #132451

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scheduler handleSchedulingFailure: DATA RACE
6 participants