Race condition / data race in scheduling nomination vs. activation
Description
The commit fixes a race condition between pod nomination and moving a pod to the active scheduling queue. Previously, a pod could be moved to the active queue before its nomination was recorded, creating a window where the scheduler could pop the pod from the queue without an associated nomination, leading to timing-based misbehavior or inconsistencies in scheduling decisions. The patch ensures nomination is recorded before moving to the active queue by:
- Recording nomination in Add() before moveToActiveQ()
- Recording nomination in Update() before moveToActiveQ()
- Documenting that queue signaling is the caller’s responsibility (no implicit wakeup signals inside add/moveToActiveQ)
- Adding tests to verify that gated pods are nominated and that nomination happens prior to activation
Proof of Concept
PoC outline to illustrate the race before the fix. This demonstrates a scenario where a pod could be moved to the active scheduling queue prior to its nomination being recorded, enabling a timing-based race that could affect scheduling decisions.
Prereqs:
- A Kubernetes scheduler-like environment with the relevant queue structures (PriorityQueue, activeQueue, nominator).
- A pod object with nomination requirements (e.g., a pod that will be nominated to a node when allowed).
- Access to run concurrent operations on the queue (to reproduce the race).
Steps (conceptual Go-like pseudocode):
1) Construct the PriorityQueue and its components (activeQ, nominator) with a test logger.
2) Create a podInfo for pod-1 and an associated Pod object.
3) Launch two goroutines concurrently:
- G1: call q.moveToActiveQ(logger, pInfo, framework.EventUnscheduledPodAdd.Label(), false)
- G2: after a brief sleep, call q.nominator.addNominatedPod(logger, pInfo.PodInfo, nil)
4) Wait for both to complete and observe the state:
- Without the fix (pre-patch behavior), G1 could advance the pod into activeQ before G2 records the nomination, causing the nominator to lack an entry for that pod.
- With the fix (post-patch behavior), G2 records the nomination before the pod is moved to activeQ, ensuring nomination is present when the pod is active.
Expected outcome after the patch:
- The nomination is always recorded before the pod enters the active queue, preventing the race window and ensuring consistent scheduling decisions.
Notes:
- The PoC is conceptual because it depends on internal scheduling queue synchronization primitives (locks, cond vars) and the nominator’s internal state. A focused unit-test or integration test that exercises concurrent calls to Add/Update and nomination would reproduce the race without the patch and confirm its absence with the patch.
Commit Details
Author: Kubernetes Prow Robot
Date: 2026-05-14 14:14 UTC
Message:
Merge pull request #139057 from macsko/store_pod_nomination_before_adding_to_scheduling_queue
Store pod nomination before adding the pod to scheduling queue
Triage Assessment
Vulnerability Type: Race condition
Confidence: MEDIUM
Reasoning:
The patch ensures pod nomination is stored before moving the pod to the scheduling active queue, addressing a potential data race where a pod could be popped from the queue before its nomination is recorded. This reduces a race condition that could enable timing-based misbehavior or security-related inconsistencies in scheduling decisions.
Verification Assessment
Vulnerability Type: Race condition / data race in scheduling nomination vs. activation
Confidence: MEDIUM
Affected Versions: v1.36.0-beta.0 (and earlier in the 1.36.x line)
Code Diff
diff --git a/pkg/scheduler/backend/queue/active_queue.go b/pkg/scheduler/backend/queue/active_queue.go
index a47e1a36a2002..7d854da0c5bfa 100644
--- a/pkg/scheduler/backend/queue/active_queue.go
+++ b/pkg/scheduler/backend/queue/active_queue.go
@@ -44,6 +44,9 @@ type activeQueuer interface {
list() []*v1.Pod
len() int
has(pInfo *framework.QueuedPodInfo) bool
+ // add adds pInfo to the activeQ.
+ // Note: it does not signal the pop() method to wake up,
+ // so the caller is responsible for calling broadcast() after executing this method.
add(logger klog.Logger, pInfo *framework.QueuedPodInfo, event string)
movePodToInFlight(pInfo *framework.QueuedPodInfo) error
@@ -350,6 +353,9 @@ func (aq *activeQueue) has(pInfo *framework.QueuedPodInfo) bool {
return aq.queue.Has(pInfo)
}
+// add adds pInfo to the activeQ.
+// Note: it does not signal the pop() method to wake up,
+// so the caller is responsible for calling broadcast() after executing this method.
func (aq *activeQueue) add(logger klog.Logger, pInfo *framework.QueuedPodInfo, event string) {
aq.lock.Lock()
defer aq.lock.Unlock()
@@ -357,7 +363,6 @@ func (aq *activeQueue) add(logger klog.Logger, pInfo *framework.QueuedPodInfo, e
aq.queue.AddOrUpdate(pInfo)
metrics.SchedulerQueueIncomingPods.WithLabelValues("active", event).Inc()
logger.V(5).Info("Pod moved to an internal scheduling queue", "pod", klog.KObj(pInfo.Pod), "event", event, "queue", activeQ)
- aq.cond.Signal()
}
// listInFlightEvents returns all inFlightEvents.
diff --git a/pkg/scheduler/backend/queue/scheduling_queue.go b/pkg/scheduler/backend/queue/scheduling_queue.go
index 230a5142d86a0..6ddb804db274f 100644
--- a/pkg/scheduler/backend/queue/scheduling_queue.go
+++ b/pkg/scheduler/backend/queue/scheduling_queue.go
@@ -657,6 +657,8 @@ func (p *PriorityQueue) AddNominatedPod(logger klog.Logger, pi fwk.PodInfo, nomi
// movesFromBackoffQ should be set to true, if the pod directly moves from the backoffQ, so the PreEnqueue call can be skipped.
// It returns a boolean flag to indicate whether the pod is added successfully.
// Pod should be removed from the backoffQ before calling moveToActiveQ
+// Note: it does not signal the Pop() method to wake up,
+// so the caller is responsible for calling activeQ.broadcast() after executing this method.
func (p *PriorityQueue) moveToActiveQ(logger klog.Logger, pInfo *framework.QueuedPodInfo, event string, movesFromBackoffQ bool) bool {
gatedBefore := pInfo.Gated()
// If SchedulerPopFromBackoffQ feature gate is enabled,
@@ -683,9 +685,6 @@ func (p *PriorityQueue) moveToActiveQ(logger klog.Logger, pInfo *framework.Queue
p.unschedulablePods.delete(pInfo.Pod, gatedBefore)
p.activeQ.add(logger, pInfo, event)
- if event == framework.EventUnscheduledPodAdd.Label() || event == framework.EventUnscheduledPodUpdate.Label() {
- p.nominator.addNominatedPod(logger, pInfo.PodInfo, nil)
- }
// Pod successfully moved to activeQ.
return true
}
@@ -722,6 +721,11 @@ func (p *PriorityQueue) Add(ctx context.Context, pod *v1.Pod) {
pInfo := p.newQueuedPodInfo(ctx, pod)
logger := klog.FromContext(ctx)
+ // addNominatedPod is called here unconditionally to ensure that the nomination of the added pod
+ // (even if gated, and thus not entering activeQ) is properly recorded in the nominator.
+ // Furthermore, this must be called before moveToActiveQ to prevent a potential data race,
+ // where an active scheduler loop could pop and process the pod before its nomination is recorded.
+ p.nominator.addNominatedPod(logger, pInfo.PodInfo, nil)
if added := p.moveToActiveQ(logger, pInfo, framework.EventUnscheduledPodAdd.Label(), false); added {
p.activeQ.broadcast()
}
@@ -1052,6 +1056,8 @@ func (p *PriorityQueue) Update(ctx context.Context, oldPod, newPod *v1.Pod) {
}
// If pod is not in any of the queues, we put it in the active queue.
pInfo := p.newQueuedPodInfo(ctx, newPod)
+ // addNominatedPod must be called before moveToActiveQ for the same reason as in the Add() method.
+ p.nominator.addNominatedPod(logger, pInfo.PodInfo, nil)
if added := p.moveToActiveQ(logger, pInfo, framework.EventUnscheduledPodUpdate.Label(), false); added {
p.activeQ.broadcast()
}
diff --git a/pkg/scheduler/backend/queue/scheduling_queue_test.go b/pkg/scheduler/backend/queue/scheduling_queue_test.go
index 96c4c53dc593a..54555dba59912 100644
--- a/pkg/scheduler/backend/queue/scheduling_queue_test.go
+++ b/pkg/scheduler/backend/queue/scheduling_queue_test.go
@@ -172,6 +172,37 @@ func TestPriorityQueue_Add(t *testing.T) {
}
}
+func TestPriorityQueue_AddNominatedGatedPod(t *testing.T) {
+ gatedPod := st.MakePod().Name("pod-gated").Namespace("ns1").UID("pod-gated").NominatedNodeName("node1").Obj()
+ objs := []runtime.Object{gatedPod}
+ _, ctx := ktesting.NewTestContext(t)
+ ctx, cancel := context.WithCancel(ctx)
+ defer cancel()
+
+ plugin := &preEnqueuePlugin{allowlists: []string{"allow"}}
+ m := map[string]map[string]fwk.PreEnqueuePlugin{
+ "": {
+ "preEnqueuePlugin": plugin,
+ },
+ }
+ q := NewTestQueueWithObjects(ctx, newDefaultQueueSort(), objs, WithPreEnqueuePluginMap(m))
+ q.Add(ctx, gatedPod)
+
+ // Verify the pod is gated
+ pInfo := q.unschedulablePods.get(gatedPod)
+ if pInfo == nil || !pInfo.Gated() {
+ t.Fatalf("Expected pod to be gated in unschedulablePods")
+ }
+
+ // Verify the pod is added to nominator
+ if len(q.nominator.nominatedPods["node1"]) != 1 {
+ t.Errorf("Expected pod-gated in nominatedPods")
+ }
+ if q.nominator.nominatedPodToNode[gatedPod.UID] != "node1" {
+ t.Errorf("Expected pod-gated in nominatedPodToNode")
+ }
+}
+
func newDefaultQueueSort() fwk.LessFunc {
sort := &queuesort.PrioritySort{}
return sort.Less
@@ -1199,6 +1230,12 @@ func TestPriorityQueue_Update(t *testing.T) {
},
}
+ withGate := func(p *v1.Pod) *v1.Pod {
+ newPod := p.DeepCopy()
+ newPod.Labels = map[string]string{"deny": "true"}
+ return newPod
+ }
+
notInAnyQueue := "NotInAnyQueue"
tests := []struct {
name string
@@ -1220,13 +1257,30 @@ func TestPriorityQueue_Update(t *testing.T) {
},
},
{
- name: "Update highPriorityPodInfo and add a nominatedNodeName to it",
+ name: "Update gated pod that didn't exist in the queue",
+ wantQ: unschedulableQ,
+ prepareFunc: func(tCtx ktesting.TContext, q *PriorityQueue) (oldPod, newPod *v1.Pod) {
+ updatedPod := withGate(medPriorityPodInfo.Pod)
+ updatedPod.Annotations["foo"] = "test"
+ return withGate(medPriorityPodInfo.Pod), updatedPod
+ },
+ },
+ {
+ name: "Update non-existent highPriorityPodInfo and add a nominatedNodeName to it",
wantQ: activeQ,
wantAddedToNominated: true,
prepareFunc: func(tCtx ktesting.TContext, q *PriorityQueue) (oldPod, newPod *v1.Pod) {
return highPriorityPodInfo.Pod, highPriNominatedPodInfo.Pod
},
},
+ {
+ name: "Update non-existent gated highPriorityPodInfo and add a nominatedNodeName to it",
+ wantQ: unschedulableQ,
+ wantAddedToNominated: true,
+ prepareFunc: func(tCtx ktesting.TContext, q *PriorityQueue) (oldPod, newPod *v1.Pod) {
+ return withGate(highPriorityPodInfo.Pod), withGate(highPriNominatedPodInfo.Pod)
+ },
+ },
{
name: "When updating a pod that is already in activeQ, the pod should remain in activeQ after Update()",
wantQ: activeQ,
@@ -1305,7 +1359,13 @@ func TestPriorityQueue_Update(t *testing.T) {
t.Run(tt.name, func(t *testing.T) {
tCtx := ktesting.Init(t)
objs := []runtime.Object{highPriorityPodInfo.Pod, unschedulablePodInfo.Pod, medPriorityPodInfo.Pod}
- q := NewTestQueueWithObjects(tCtx, newDefaultQueueSort(), objs, WithClock(c), WithQueueingHintMapPerProfile(queueingHintMap))
+ plugin := &denyingPreEnqueuePlugin{denylists: []string{"deny"}}
+ m := map[string]map[string]fwk.PreEnqueuePlugin{
+ "": {
+ "denyingPreEnqueuePlugin": plugin,
+ },
+ }
+ q := NewTestQueueWithObjects(tCtx, newDefaultQueueSort(), objs, WithClock(c), WithQueueingHintMapPerProfile(queueingHintMap), WithPreEnqueuePluginMap(m))
oldPod, newPod := tt.prepareFunc(tCtx, q)
@@ -1347,7 +1407,7 @@ func TestPriorityQueue_Update(t *testing.T) {
}
if tt.wantAddedToNominated && len(q.nominator.nominatedPods) != 1 {
- t.Errorf("Expected one item in nominatedPods map: %v", q.nominator)
+ t.Errorf("Expected one item in nominatedPods map: %v", q.nominator.nominatedPods)
}
})
@@ -1709,7 +1769,26 @@ func (pl *preEnqueuePlugin) PreEnqueue(ctx context.Context, p *v1.Pod) *fwk.Stat
}
}
}
- return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, "pod name not in allowlists")
+ return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, "pod label not in allowlists")
+}
+
+type denyingPreEnqueuePlugin struct {
+ denylists []string
+}
+
+func (pl *denyingPreEnqueuePlugin) Name() string {
+ return "denyingPreEnqueuePlugin"
+}
+
+func (pl *denyingPreEnqueuePlugin) PreEnqueue(ctx context.Context, p *v1.Pod) *fwk.Status {
+ for _, denied := range pl.denylists {
+ for label := range p.Labels {
+ if label == denied {
+ return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, "pod label in denylists")
+ }
+ }
+ }
+ return nil
}
func TestPriorityQueue_moveToActiveQ(t *testing.T) {