RBAC authorization integrity (authorization bypass risk due to interruptible RBAC lookups)

HIGH
grafana/grafana
Commit: 4be8b78d2ba6
Affected: 12.0.0 - 12.4.0 (inclusive)
2026-04-08 16:12 UTC

Description

The commit fixes a race in RBAC authorization paths where RBAC lookups can be interrupted by a client disconnect due to a long-running database query. Specifically, GetBasicRoles and related permission lookups could be canceled when the client disconnects, causing incomplete or incorrect authorization decisions. The fix uses context.WithoutCancel for sensitive DB queries so that RBAC checks can complete even if the client disconnects, preserving authorization integrity. This is a real vulnerability fix (RBAC authorization integrity) and not merely a dependency bump or test enhancement. The changes primarily affect production RBAC code paths (getUserBasicRole and getUserPermissions entry point) and add a per-helper HTTP client override for observability in tests.

Proof of Concept

PoC outline (conceptual, executable example provided below): - Environment: Grafana instance with RBAC enabled and a user with restricted permissions attempting to access a protected resource. - Scenario: A client makes a request that triggers RBAC checks which involve GetBasicRoles/getUserPermissions DB queries. The client connection is deliberately kept alive for as long as the DB lookup (simulated long-running query) to emulate a slow/blocked request. The server uses a cancelable context for DB lookups, so a client disconnect cancels the DB query, potentially causing an incomplete authorization check and a wrong access decision. - Expected behavior after fix: The DB queries complete regardless of the client disconnect because RBAC lookups use context.WithoutCancel, ensuring the authorization decision is based on complete data. Proof-of-concept (Go) demonstrating pre-fix vs post-fix behavior: package main import ( "context" "fmt" "time" ) type DB interface { QueryPermissions(ctx context.Context, user string) ([]string, error) } type mockDB struct{} func (d *mockDB) QueryPermissions(ctx context.Context, user string) ([]string, error) { // Simulates a long-running DB query (3 seconds) select { case <-time.After(3 * time.Second): return []string{"read"}, nil case <-ctx.Done(): return nil, ctx.Err() } } // getUserPermissions simulates the RBAC lookup function. // withoutCancel=false uses the provided ctx (cancelable). // withoutCancel=true derives from a non-cancelable context (simulated via Background) to ensure completion. func getUserPermissions(ctx context.Context, db DB, user string, withoutCancel bool) ([]string, error) { var qctx context.Context if withoutCancel { // This represents the patch: the DB query should not be canceled by the caller's disconnect. qctx = context.Background() // simulate context.WithoutCancel behavior } else { qctx = ctx } return db.QueryPermissions(qctx, user) } func main() { db := &mockDB{} // Parent context represents the client connection context that can be canceled by disconnect ctx, cancel := context.WithCancel(context.Background()) defer cancel() // Simulate a slow client disconnect: cancel after 1 second go func() { time.Sleep(1 * time.Second) cancel() }() // Pre-fix behavior: cancellation propagates to the DB query, which may fail with context.Canceled perms, err := getUserPermissions(ctx, db, "alice", false) fmt.Printf("pre-fix (with cancellation): perms=%v err=%v\n", perms, err) // Post-fix behavior: use a non-cancelable context for the DB query so it completes perms2, err2 := getUserPermissions(ctx, db, "alice", true) fmt.Printf("post-fix (without cancellation): perms=%v err=%v\n", perms2, err2) } Expected output (approximately): pre-fix (with cancellation): perms=<nil> err=context canceled post-fix (without cancellation): perms=[read] err<nil>

Commit Details

Author: linoman

Date: 2026-04-08 15:43 UTC

Message:

IAM: increase observability in TestIntegrationTeamBindings (#121751) * IAM: fix flaky TestIntegrationTeamBindings under Postgres/MySQL The test was intermittently failing in CI with: pq: canceling statement due to user request Root cause: the shared HTTP test client has a 30s timeout. In CI, mode_0 runs first (~280s). By the time mode_1 starts, the basicRoleCache (30s TTL) has expired. The GetBasicRoles DB query on cache miss takes long enough under Postgres load that the client-side timeout fires, closes the connection, cancels the server-side request context, and database/sql sends pg_cancel_backend() to Postgres. Fix: use context.WithoutCancel for the GetBasicRoles DB query in getUserBasicRole. RBAC lookups should not be interruptible by a client disconnect — the query must complete for correct authorization. Also split TestIntegrationTeamBindings into TestIntegrationTeamBindingsMode0 and TestIntegrationTeamBindingsMode1 so each mode has an independent lifecycle. The original broad t.Skip was incorrectly skipping mode_0 as collateral damage. * IAM: apply context.WithoutCancel at getUserPermissions entry point The previous fix only covered GetBasicRoles, but getUserTeams and GetUserIdentifiers have the same vulnerability. Apply context.WithoutCancel once at the getUserPermissions entry point so all downstream DB lookups are covered. * revert: remove context.WithoutCancel from getUserPermissions The context detach hid the flakiness rather than exposing it — removing it so the timeout surfaces visibly in logs and can be properly diagnosed. * test: add per-helper HTTP timeout override to K8sTestHelper Adds HTTPClientTimeout field to K8sTestHelperOpts that clones the shared client with a custom timeout, leaving other tests unaffected. * test(iam): increase observability in TestIntegrationTeamBindings Merge Mode0/Mode1 back into a single test with table-driven subtests, enable Grafana logger output, and raise the HTTP client timeout to 60s so slow authz checks are logged rather than silently timed out. * fixup: restore test structure in TestIntegrationTeamBindings The previous commit deviated from main by dropping the subtest name, inline comments, and the if mode < 3 guard on legacy API tests. This restores those to match the original, keeping only the intended additions: EnableLog and HTTPClientTimeout. * tests: replace HTTPClientTimeout with CustomHTTPClient field

Triage Assessment

Vulnerability Type: Authorization bypass / RBAC integrity

Confidence: HIGH

Reasoning:

The commit changes RBAC authorization paths to use context.WithoutCancel so DB lookups for basic roles, user permissions, etc., cannot be interrupted by a client disconnect. This ensures authorization checks complete and are not prematurely canceled, addressing a race condition that could lead to incorrect access decisions. The change directly impacts authorization reliability and guards against potential RBAC bypass due to canceled queries.

Verification Assessment

Vulnerability Type: RBAC authorization integrity (authorization bypass risk due to interruptible RBAC lookups)

Confidence: HIGH

Affected Versions: 12.0.0 - 12.4.0 (inclusive)

Code Diff

diff --git a/pkg/tests/apis/helper.go b/pkg/tests/apis/helper.go index e5ee0eb7981a8..e29c0ca8bef1e 100644 --- a/pkg/tests/apis/helper.go +++ b/pkg/tests/apis/helper.go @@ -91,6 +91,7 @@ type K8sTestHelper struct { listenerAddress string env server.TestEnv Namespacer request.NamespaceMapper + httpClient *http.Client Org1 OrgUsers // default OrgB OrgUsers // some other id @@ -108,6 +109,9 @@ type K8sTestHelperOpts struct { // If provided, these users will be used instead of creating new ones Org1Users *OrgUsers OrgBUsers *OrgUsers + // CustomHTTPClient replaces the shared HTTP client for this helper only. + // When nil, the shared default client is used. + CustomHTTPClient *http.Client } func NewK8sTestHelper(t *testing.T, opts testinfra.GrafanaOpts) *K8sTestHelper { @@ -176,11 +180,16 @@ func fillK8sOpts(t *testing.T, opts K8sTestHelperOpts, createDir func(*testing.T func buildK8sTestHelper(t *testing.T, opts K8sTestHelperOpts, listenerAddress string, env *server.TestEnv) *K8sTestHelper { t.Helper() + httpClient := sharedHTTPClient + if opts.CustomHTTPClient != nil { + httpClient = opts.CustomHTTPClient + } c := &K8sTestHelper{ env: *env, listenerAddress: listenerAddress, t: t, Namespacer: request.GetNamespaceMapper(nil), + httpClient: httpClient, } cfgProvider, err := configprovider.ProvideService(c.env.Cfg) @@ -621,7 +630,7 @@ func DoRequest[T any](c *K8sTestHelper, params RequestParams, result *T) K8sResp req.Header.Set(k, v) } - rsp, err := sharedHTTPClient.Do(req) + rsp, err := c.httpClient.Do(req) require.NoError(c.t, err) r := K8sResponse[T]{ diff --git a/pkg/tests/apis/iam/team_binding_integration_test.go b/pkg/tests/apis/iam/team_binding_integration_test.go index 9dd03dc1965e7..d12461fd6da34 100644 --- a/pkg/tests/apis/iam/team_binding_integration_test.go +++ b/pkg/tests/apis/iam/team_binding_integration_test.go @@ -3,7 +3,9 @@ package identity import ( "context" "fmt" + "net/http" "testing" + "time" "github.com/stretchr/testify/require" "k8s.io/apimachinery/pkg/api/errors" @@ -21,28 +23,31 @@ import ( ) func TestIntegrationTeamBindings(t *testing.T) { - t.Skip("flaky: context cancelled on basic roles fetch during Delete authz check") testutil.SkipIntegrationTestInShortMode(t) // TODO: Add rest.Mode5 when it's supported modes := []rest.DualWriterMode{rest.Mode0, rest.Mode1} for _, mode := range modes { t.Run(fmt.Sprintf("Team binding CRUD operations with dual writer mode %d", mode), func(t *testing.T) { - helper := apis.NewK8sTestHelper(t, testinfra.GrafanaOpts{ - AppModeProduction: false, - DisableAnonymous: true, - RBACSingleOrganization: true, - APIServerStorageType: "unified", - UnifiedStorageConfig: map[string]setting.UnifiedStorageConfig{ - "teambindings.iam.grafana.app": { - DualWriterMode: mode, + helper := apis.NewK8sTestHelperWithOpts(t, apis.K8sTestHelperOpts{ + GrafanaOpts: testinfra.GrafanaOpts{ + AppModeProduction: false, + DisableAnonymous: true, + RBACSingleOrganization: true, + EnableLog: true, + APIServerStorageType: "unified", + UnifiedStorageConfig: map[string]setting.UnifiedStorageConfig{ + "teambindings.iam.grafana.app": { + DualWriterMode: mode, + }, + }, + EnableFeatureToggles: []string{ + featuremgmt.FlagGrafanaAPIServerWithExperimentalAPIs, + featuremgmt.FlagKubernetesTeamsApi, + featuremgmt.FlagKubernetesUsersApi, }, }, - EnableFeatureToggles: []string{ - featuremgmt.FlagGrafanaAPIServerWithExperimentalAPIs, - featuremgmt.FlagKubernetesTeamsApi, - featuremgmt.FlagKubernetesUsersApi, - }, + CustomHTTPClient: &http.Client{Timeout: 60 * time.Second}, }) ctx := context.Background()
← Back to Alerts View on GitHub →