Race condition / data race in scheduling nomination vs. activation

MEDIUM
kubernetes/kubernetes
Commit: 6bca55305136
Affected: v1.36.0-beta.0 (and earlier in the 1.36.x line)
2026-05-26 19:37 UTC

Description

The commit fixes a race condition between pod nomination and moving a pod to the active scheduling queue. Previously, a pod could be moved to the active queue before its nomination was recorded, creating a window where the scheduler could pop the pod from the queue without an associated nomination, leading to timing-based misbehavior or inconsistencies in scheduling decisions. The patch ensures nomination is recorded before moving to the active queue by: - Recording nomination in Add() before moveToActiveQ() - Recording nomination in Update() before moveToActiveQ() - Documenting that queue signaling is the caller’s responsibility (no implicit wakeup signals inside add/moveToActiveQ) - Adding tests to verify that gated pods are nominated and that nomination happens prior to activation

Proof of Concept

PoC outline to illustrate the race before the fix. This demonstrates a scenario where a pod could be moved to the active scheduling queue prior to its nomination being recorded, enabling a timing-based race that could affect scheduling decisions. Prereqs: - A Kubernetes scheduler-like environment with the relevant queue structures (PriorityQueue, activeQueue, nominator). - A pod object with nomination requirements (e.g., a pod that will be nominated to a node when allowed). - Access to run concurrent operations on the queue (to reproduce the race). Steps (conceptual Go-like pseudocode): 1) Construct the PriorityQueue and its components (activeQ, nominator) with a test logger. 2) Create a podInfo for pod-1 and an associated Pod object. 3) Launch two goroutines concurrently: - G1: call q.moveToActiveQ(logger, pInfo, framework.EventUnscheduledPodAdd.Label(), false) - G2: after a brief sleep, call q.nominator.addNominatedPod(logger, pInfo.PodInfo, nil) 4) Wait for both to complete and observe the state: - Without the fix (pre-patch behavior), G1 could advance the pod into activeQ before G2 records the nomination, causing the nominator to lack an entry for that pod. - With the fix (post-patch behavior), G2 records the nomination before the pod is moved to activeQ, ensuring nomination is present when the pod is active. Expected outcome after the patch: - The nomination is always recorded before the pod enters the active queue, preventing the race window and ensuring consistent scheduling decisions. Notes: - The PoC is conceptual because it depends on internal scheduling queue synchronization primitives (locks, cond vars) and the nominator’s internal state. A focused unit-test or integration test that exercises concurrent calls to Add/Update and nomination would reproduce the race without the patch and confirm its absence with the patch.

Commit Details

Author: Kubernetes Prow Robot

Date: 2026-05-14 14:14 UTC

Message:

Merge pull request #139057 from macsko/store_pod_nomination_before_adding_to_scheduling_queue Store pod nomination before adding the pod to scheduling queue

Triage Assessment

Vulnerability Type: Race condition

Confidence: MEDIUM

Reasoning:

The patch ensures pod nomination is stored before moving the pod to the scheduling active queue, addressing a potential data race where a pod could be popped from the queue before its nomination is recorded. This reduces a race condition that could enable timing-based misbehavior or security-related inconsistencies in scheduling decisions.

Verification Assessment

Vulnerability Type: Race condition / data race in scheduling nomination vs. activation

Confidence: MEDIUM

Affected Versions: v1.36.0-beta.0 (and earlier in the 1.36.x line)

Code Diff

diff --git a/pkg/scheduler/backend/queue/active_queue.go b/pkg/scheduler/backend/queue/active_queue.go index a47e1a36a2002..7d854da0c5bfa 100644 --- a/pkg/scheduler/backend/queue/active_queue.go +++ b/pkg/scheduler/backend/queue/active_queue.go @@ -44,6 +44,9 @@ type activeQueuer interface { list() []*v1.Pod len() int has(pInfo *framework.QueuedPodInfo) bool + // add adds pInfo to the activeQ. + // Note: it does not signal the pop() method to wake up, + // so the caller is responsible for calling broadcast() after executing this method. add(logger klog.Logger, pInfo *framework.QueuedPodInfo, event string) movePodToInFlight(pInfo *framework.QueuedPodInfo) error @@ -350,6 +353,9 @@ func (aq *activeQueue) has(pInfo *framework.QueuedPodInfo) bool { return aq.queue.Has(pInfo) } +// add adds pInfo to the activeQ. +// Note: it does not signal the pop() method to wake up, +// so the caller is responsible for calling broadcast() after executing this method. func (aq *activeQueue) add(logger klog.Logger, pInfo *framework.QueuedPodInfo, event string) { aq.lock.Lock() defer aq.lock.Unlock() @@ -357,7 +363,6 @@ func (aq *activeQueue) add(logger klog.Logger, pInfo *framework.QueuedPodInfo, e aq.queue.AddOrUpdate(pInfo) metrics.SchedulerQueueIncomingPods.WithLabelValues("active", event).Inc() logger.V(5).Info("Pod moved to an internal scheduling queue", "pod", klog.KObj(pInfo.Pod), "event", event, "queue", activeQ) - aq.cond.Signal() } // listInFlightEvents returns all inFlightEvents. diff --git a/pkg/scheduler/backend/queue/scheduling_queue.go b/pkg/scheduler/backend/queue/scheduling_queue.go index 230a5142d86a0..6ddb804db274f 100644 --- a/pkg/scheduler/backend/queue/scheduling_queue.go +++ b/pkg/scheduler/backend/queue/scheduling_queue.go @@ -657,6 +657,8 @@ func (p *PriorityQueue) AddNominatedPod(logger klog.Logger, pi fwk.PodInfo, nomi // movesFromBackoffQ should be set to true, if the pod directly moves from the backoffQ, so the PreEnqueue call can be skipped. // It returns a boolean flag to indicate whether the pod is added successfully. // Pod should be removed from the backoffQ before calling moveToActiveQ +// Note: it does not signal the Pop() method to wake up, +// so the caller is responsible for calling activeQ.broadcast() after executing this method. func (p *PriorityQueue) moveToActiveQ(logger klog.Logger, pInfo *framework.QueuedPodInfo, event string, movesFromBackoffQ bool) bool { gatedBefore := pInfo.Gated() // If SchedulerPopFromBackoffQ feature gate is enabled, @@ -683,9 +685,6 @@ func (p *PriorityQueue) moveToActiveQ(logger klog.Logger, pInfo *framework.Queue p.unschedulablePods.delete(pInfo.Pod, gatedBefore) p.activeQ.add(logger, pInfo, event) - if event == framework.EventUnscheduledPodAdd.Label() || event == framework.EventUnscheduledPodUpdate.Label() { - p.nominator.addNominatedPod(logger, pInfo.PodInfo, nil) - } // Pod successfully moved to activeQ. return true } @@ -722,6 +721,11 @@ func (p *PriorityQueue) Add(ctx context.Context, pod *v1.Pod) { pInfo := p.newQueuedPodInfo(ctx, pod) logger := klog.FromContext(ctx) + // addNominatedPod is called here unconditionally to ensure that the nomination of the added pod + // (even if gated, and thus not entering activeQ) is properly recorded in the nominator. + // Furthermore, this must be called before moveToActiveQ to prevent a potential data race, + // where an active scheduler loop could pop and process the pod before its nomination is recorded. + p.nominator.addNominatedPod(logger, pInfo.PodInfo, nil) if added := p.moveToActiveQ(logger, pInfo, framework.EventUnscheduledPodAdd.Label(), false); added { p.activeQ.broadcast() } @@ -1052,6 +1056,8 @@ func (p *PriorityQueue) Update(ctx context.Context, oldPod, newPod *v1.Pod) { } // If pod is not in any of the queues, we put it in the active queue. pInfo := p.newQueuedPodInfo(ctx, newPod) + // addNominatedPod must be called before moveToActiveQ for the same reason as in the Add() method. + p.nominator.addNominatedPod(logger, pInfo.PodInfo, nil) if added := p.moveToActiveQ(logger, pInfo, framework.EventUnscheduledPodUpdate.Label(), false); added { p.activeQ.broadcast() } diff --git a/pkg/scheduler/backend/queue/scheduling_queue_test.go b/pkg/scheduler/backend/queue/scheduling_queue_test.go index 96c4c53dc593a..54555dba59912 100644 --- a/pkg/scheduler/backend/queue/scheduling_queue_test.go +++ b/pkg/scheduler/backend/queue/scheduling_queue_test.go @@ -172,6 +172,37 @@ func TestPriorityQueue_Add(t *testing.T) { } } +func TestPriorityQueue_AddNominatedGatedPod(t *testing.T) { + gatedPod := st.MakePod().Name("pod-gated").Namespace("ns1").UID("pod-gated").NominatedNodeName("node1").Obj() + objs := []runtime.Object{gatedPod} + _, ctx := ktesting.NewTestContext(t) + ctx, cancel := context.WithCancel(ctx) + defer cancel() + + plugin := &preEnqueuePlugin{allowlists: []string{"allow"}} + m := map[string]map[string]fwk.PreEnqueuePlugin{ + "": { + "preEnqueuePlugin": plugin, + }, + } + q := NewTestQueueWithObjects(ctx, newDefaultQueueSort(), objs, WithPreEnqueuePluginMap(m)) + q.Add(ctx, gatedPod) + + // Verify the pod is gated + pInfo := q.unschedulablePods.get(gatedPod) + if pInfo == nil || !pInfo.Gated() { + t.Fatalf("Expected pod to be gated in unschedulablePods") + } + + // Verify the pod is added to nominator + if len(q.nominator.nominatedPods["node1"]) != 1 { + t.Errorf("Expected pod-gated in nominatedPods") + } + if q.nominator.nominatedPodToNode[gatedPod.UID] != "node1" { + t.Errorf("Expected pod-gated in nominatedPodToNode") + } +} + func newDefaultQueueSort() fwk.LessFunc { sort := &queuesort.PrioritySort{} return sort.Less @@ -1199,6 +1230,12 @@ func TestPriorityQueue_Update(t *testing.T) { }, } + withGate := func(p *v1.Pod) *v1.Pod { + newPod := p.DeepCopy() + newPod.Labels = map[string]string{"deny": "true"} + return newPod + } + notInAnyQueue := "NotInAnyQueue" tests := []struct { name string @@ -1220,13 +1257,30 @@ func TestPriorityQueue_Update(t *testing.T) { }, }, { - name: "Update highPriorityPodInfo and add a nominatedNodeName to it", + name: "Update gated pod that didn't exist in the queue", + wantQ: unschedulableQ, + prepareFunc: func(tCtx ktesting.TContext, q *PriorityQueue) (oldPod, newPod *v1.Pod) { + updatedPod := withGate(medPriorityPodInfo.Pod) + updatedPod.Annotations["foo"] = "test" + return withGate(medPriorityPodInfo.Pod), updatedPod + }, + }, + { + name: "Update non-existent highPriorityPodInfo and add a nominatedNodeName to it", wantQ: activeQ, wantAddedToNominated: true, prepareFunc: func(tCtx ktesting.TContext, q *PriorityQueue) (oldPod, newPod *v1.Pod) { return highPriorityPodInfo.Pod, highPriNominatedPodInfo.Pod }, }, + { + name: "Update non-existent gated highPriorityPodInfo and add a nominatedNodeName to it", + wantQ: unschedulableQ, + wantAddedToNominated: true, + prepareFunc: func(tCtx ktesting.TContext, q *PriorityQueue) (oldPod, newPod *v1.Pod) { + return withGate(highPriorityPodInfo.Pod), withGate(highPriNominatedPodInfo.Pod) + }, + }, { name: "When updating a pod that is already in activeQ, the pod should remain in activeQ after Update()", wantQ: activeQ, @@ -1305,7 +1359,13 @@ func TestPriorityQueue_Update(t *testing.T) { t.Run(tt.name, func(t *testing.T) { tCtx := ktesting.Init(t) objs := []runtime.Object{highPriorityPodInfo.Pod, unschedulablePodInfo.Pod, medPriorityPodInfo.Pod} - q := NewTestQueueWithObjects(tCtx, newDefaultQueueSort(), objs, WithClock(c), WithQueueingHintMapPerProfile(queueingHintMap)) + plugin := &denyingPreEnqueuePlugin{denylists: []string{"deny"}} + m := map[string]map[string]fwk.PreEnqueuePlugin{ + "": { + "denyingPreEnqueuePlugin": plugin, + }, + } + q := NewTestQueueWithObjects(tCtx, newDefaultQueueSort(), objs, WithClock(c), WithQueueingHintMapPerProfile(queueingHintMap), WithPreEnqueuePluginMap(m)) oldPod, newPod := tt.prepareFunc(tCtx, q) @@ -1347,7 +1407,7 @@ func TestPriorityQueue_Update(t *testing.T) { } if tt.wantAddedToNominated && len(q.nominator.nominatedPods) != 1 { - t.Errorf("Expected one item in nominatedPods map: %v", q.nominator) + t.Errorf("Expected one item in nominatedPods map: %v", q.nominator.nominatedPods) } }) @@ -1709,7 +1769,26 @@ func (pl *preEnqueuePlugin) PreEnqueue(ctx context.Context, p *v1.Pod) *fwk.Stat } } } - return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, "pod name not in allowlists") + return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, "pod label not in allowlists") +} + +type denyingPreEnqueuePlugin struct { + denylists []string +} + +func (pl *denyingPreEnqueuePlugin) Name() string { + return "denyingPreEnqueuePlugin" +} + +func (pl *denyingPreEnqueuePlugin) PreEnqueue(ctx context.Context, p *v1.Pod) *fwk.Status { + for _, denied := range pl.denylists { + for label := range p.Labels { + if label == denied { + return fwk.NewStatus(fwk.UnschedulableAndUnresolvable, "pod label in denylists") + } + } + } + return nil } func TestPriorityQueue_moveToActiveQ(t *testing.T) {
← Back to Alerts View on GitHub →