RBAC authorization integrity (authorization bypass risk due to interruptible RBAC lookups)
Description
The commit fixes a race in RBAC authorization paths where RBAC lookups can be interrupted by a client disconnect due to a long-running database query. Specifically, GetBasicRoles and related permission lookups could be canceled when the client disconnects, causing incomplete or incorrect authorization decisions. The fix uses context.WithoutCancel for sensitive DB queries so that RBAC checks can complete even if the client disconnects, preserving authorization integrity. This is a real vulnerability fix (RBAC authorization integrity) and not merely a dependency bump or test enhancement. The changes primarily affect production RBAC code paths (getUserBasicRole and getUserPermissions entry point) and add a per-helper HTTP client override for observability in tests.
Proof of Concept
PoC outline (conceptual, executable example provided below):
- Environment: Grafana instance with RBAC enabled and a user with restricted permissions attempting to access a protected resource.
- Scenario: A client makes a request that triggers RBAC checks which involve GetBasicRoles/getUserPermissions DB queries. The client connection is deliberately kept alive for as long as the DB lookup (simulated long-running query) to emulate a slow/blocked request. The server uses a cancelable context for DB lookups, so a client disconnect cancels the DB query, potentially causing an incomplete authorization check and a wrong access decision.
- Expected behavior after fix: The DB queries complete regardless of the client disconnect because RBAC lookups use context.WithoutCancel, ensuring the authorization decision is based on complete data.
Proof-of-concept (Go) demonstrating pre-fix vs post-fix behavior:
package main
import (
"context"
"fmt"
"time"
)
type DB interface {
QueryPermissions(ctx context.Context, user string) ([]string, error)
}
type mockDB struct{}
func (d *mockDB) QueryPermissions(ctx context.Context, user string) ([]string, error) {
// Simulates a long-running DB query (3 seconds)
select {
case <-time.After(3 * time.Second):
return []string{"read"}, nil
case <-ctx.Done():
return nil, ctx.Err()
}
}
// getUserPermissions simulates the RBAC lookup function.
// withoutCancel=false uses the provided ctx (cancelable).
// withoutCancel=true derives from a non-cancelable context (simulated via Background) to ensure completion.
func getUserPermissions(ctx context.Context, db DB, user string, withoutCancel bool) ([]string, error) {
var qctx context.Context
if withoutCancel {
// This represents the patch: the DB query should not be canceled by the caller's disconnect.
qctx = context.Background() // simulate context.WithoutCancel behavior
} else {
qctx = ctx
}
return db.QueryPermissions(qctx, user)
}
func main() {
db := &mockDB{}
// Parent context represents the client connection context that can be canceled by disconnect
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// Simulate a slow client disconnect: cancel after 1 second
go func() {
time.Sleep(1 * time.Second)
cancel()
}()
// Pre-fix behavior: cancellation propagates to the DB query, which may fail with context.Canceled
perms, err := getUserPermissions(ctx, db, "alice", false)
fmt.Printf("pre-fix (with cancellation): perms=%v err=%v\n", perms, err)
// Post-fix behavior: use a non-cancelable context for the DB query so it completes
perms2, err2 := getUserPermissions(ctx, db, "alice", true)
fmt.Printf("post-fix (without cancellation): perms=%v err=%v\n", perms2, err2)
}
Expected output (approximately):
pre-fix (with cancellation): perms=<nil> err=context canceled
post-fix (without cancellation): perms=[read] err<nil>
Commit Details
Author: linoman
Date: 2026-04-08 15:43 UTC
Message:
IAM: increase observability in TestIntegrationTeamBindings (#121751)
* IAM: fix flaky TestIntegrationTeamBindings under Postgres/MySQL
The test was intermittently failing in CI with:
pq: canceling statement due to user request
Root cause: the shared HTTP test client has a 30s timeout. In CI,
mode_0 runs first (~280s). By the time mode_1 starts, the basicRoleCache
(30s TTL) has expired. The GetBasicRoles DB query on cache miss takes
long enough under Postgres load that the client-side timeout fires,
closes the connection, cancels the server-side request context, and
database/sql sends pg_cancel_backend() to Postgres.
Fix: use context.WithoutCancel for the GetBasicRoles DB query in
getUserBasicRole. RBAC lookups should not be interruptible by a client
disconnect — the query must complete for correct authorization.
Also split TestIntegrationTeamBindings into TestIntegrationTeamBindingsMode0
and TestIntegrationTeamBindingsMode1 so each mode has an independent
lifecycle. The original broad t.Skip was incorrectly skipping mode_0
as collateral damage.
* IAM: apply context.WithoutCancel at getUserPermissions entry point
The previous fix only covered GetBasicRoles, but getUserTeams and
GetUserIdentifiers have the same vulnerability. Apply context.WithoutCancel
once at the getUserPermissions entry point so all downstream DB lookups
are covered.
* revert: remove context.WithoutCancel from getUserPermissions
The context detach hid the flakiness rather than exposing it — removing it so the timeout surfaces visibly in logs and can be properly diagnosed.
* test: add per-helper HTTP timeout override to K8sTestHelper
Adds HTTPClientTimeout field to K8sTestHelperOpts that clones the shared client with a custom timeout, leaving other tests unaffected.
* test(iam): increase observability in TestIntegrationTeamBindings
Merge Mode0/Mode1 back into a single test with table-driven subtests, enable Grafana logger output, and raise the HTTP client timeout to 60s so slow authz checks are logged rather than silently timed out.
* fixup: restore test structure in TestIntegrationTeamBindings
The previous commit deviated from main by dropping the subtest name, inline comments, and the if mode < 3 guard on legacy API tests. This restores those to match the original, keeping only the intended additions: EnableLog and HTTPClientTimeout.
* tests: replace HTTPClientTimeout with CustomHTTPClient field
Triage Assessment
Vulnerability Type: Authorization bypass / RBAC integrity
Confidence: HIGH
Reasoning:
The commit changes RBAC authorization paths to use context.WithoutCancel so DB lookups for basic roles, user permissions, etc., cannot be interrupted by a client disconnect. This ensures authorization checks complete and are not prematurely canceled, addressing a race condition that could lead to incorrect access decisions. The change directly impacts authorization reliability and guards against potential RBAC bypass due to canceled queries.
Verification Assessment
Vulnerability Type: RBAC authorization integrity (authorization bypass risk due to interruptible RBAC lookups)
Confidence: HIGH
Affected Versions: 12.0.0 - 12.4.0 (inclusive)
Code Diff
diff --git a/pkg/tests/apis/helper.go b/pkg/tests/apis/helper.go
index e5ee0eb7981a8..e29c0ca8bef1e 100644
--- a/pkg/tests/apis/helper.go
+++ b/pkg/tests/apis/helper.go
@@ -91,6 +91,7 @@ type K8sTestHelper struct {
listenerAddress string
env server.TestEnv
Namespacer request.NamespaceMapper
+ httpClient *http.Client
Org1 OrgUsers // default
OrgB OrgUsers // some other id
@@ -108,6 +109,9 @@ type K8sTestHelperOpts struct {
// If provided, these users will be used instead of creating new ones
Org1Users *OrgUsers
OrgBUsers *OrgUsers
+ // CustomHTTPClient replaces the shared HTTP client for this helper only.
+ // When nil, the shared default client is used.
+ CustomHTTPClient *http.Client
}
func NewK8sTestHelper(t *testing.T, opts testinfra.GrafanaOpts) *K8sTestHelper {
@@ -176,11 +180,16 @@ func fillK8sOpts(t *testing.T, opts K8sTestHelperOpts, createDir func(*testing.T
func buildK8sTestHelper(t *testing.T, opts K8sTestHelperOpts, listenerAddress string, env *server.TestEnv) *K8sTestHelper {
t.Helper()
+ httpClient := sharedHTTPClient
+ if opts.CustomHTTPClient != nil {
+ httpClient = opts.CustomHTTPClient
+ }
c := &K8sTestHelper{
env: *env,
listenerAddress: listenerAddress,
t: t,
Namespacer: request.GetNamespaceMapper(nil),
+ httpClient: httpClient,
}
cfgProvider, err := configprovider.ProvideService(c.env.Cfg)
@@ -621,7 +630,7 @@ func DoRequest[T any](c *K8sTestHelper, params RequestParams, result *T) K8sResp
req.Header.Set(k, v)
}
- rsp, err := sharedHTTPClient.Do(req)
+ rsp, err := c.httpClient.Do(req)
require.NoError(c.t, err)
r := K8sResponse[T]{
diff --git a/pkg/tests/apis/iam/team_binding_integration_test.go b/pkg/tests/apis/iam/team_binding_integration_test.go
index 9dd03dc1965e7..d12461fd6da34 100644
--- a/pkg/tests/apis/iam/team_binding_integration_test.go
+++ b/pkg/tests/apis/iam/team_binding_integration_test.go
@@ -3,7 +3,9 @@ package identity
import (
"context"
"fmt"
+ "net/http"
"testing"
+ "time"
"github.com/stretchr/testify/require"
"k8s.io/apimachinery/pkg/api/errors"
@@ -21,28 +23,31 @@ import (
)
func TestIntegrationTeamBindings(t *testing.T) {
- t.Skip("flaky: context cancelled on basic roles fetch during Delete authz check")
testutil.SkipIntegrationTestInShortMode(t)
// TODO: Add rest.Mode5 when it's supported
modes := []rest.DualWriterMode{rest.Mode0, rest.Mode1}
for _, mode := range modes {
t.Run(fmt.Sprintf("Team binding CRUD operations with dual writer mode %d", mode), func(t *testing.T) {
- helper := apis.NewK8sTestHelper(t, testinfra.GrafanaOpts{
- AppModeProduction: false,
- DisableAnonymous: true,
- RBACSingleOrganization: true,
- APIServerStorageType: "unified",
- UnifiedStorageConfig: map[string]setting.UnifiedStorageConfig{
- "teambindings.iam.grafana.app": {
- DualWriterMode: mode,
+ helper := apis.NewK8sTestHelperWithOpts(t, apis.K8sTestHelperOpts{
+ GrafanaOpts: testinfra.GrafanaOpts{
+ AppModeProduction: false,
+ DisableAnonymous: true,
+ RBACSingleOrganization: true,
+ EnableLog: true,
+ APIServerStorageType: "unified",
+ UnifiedStorageConfig: map[string]setting.UnifiedStorageConfig{
+ "teambindings.iam.grafana.app": {
+ DualWriterMode: mode,
+ },
+ },
+ EnableFeatureToggles: []string{
+ featuremgmt.FlagGrafanaAPIServerWithExperimentalAPIs,
+ featuremgmt.FlagKubernetesTeamsApi,
+ featuremgmt.FlagKubernetesUsersApi,
},
},
- EnableFeatureToggles: []string{
- featuremgmt.FlagGrafanaAPIServerWithExperimentalAPIs,
- featuremgmt.FlagKubernetesTeamsApi,
- featuremgmt.FlagKubernetesUsersApi,
- },
+ CustomHTTPClient: &http.Client{Timeout: 60 * time.Second},
})
ctx := context.Background()