Race Condition
Description
The commit adds a GCOM guard that prevents local tenant data deletion until the external GCOM stack is confirmed deleted. prior to this change, local deletion could proceed without verifying the external GCOM state, creating a race where data could be deleted locally while the corresponding external stack still exists, potentially leading to data inconsistency or exposure. The fix introduces a GCOM check in the TenantDeleter run loop and a helper gcomAllowsTenantDeletion to query GCOM for the instance status and only proceed when GCOM reports the stack as deleted.
Proof of Concept
Proof-of-concept PoC (pre-fix behavior):
- Prerequisites: Grafana with unified storage enabled; GCOM integration is not configured or returns a non-deleted status for the stack of a tenant being deleted.
- Create a tenant namespace (e.g. stacks-1) with resources (apps, dashboards, etc.).
- Create a PendingDeleteRecord for that tenant with a short DeleteAfter window.
- Do not configure GCOM or configure GCOM to return a non-deleted status for the tenant’s stack (e.g., Status = "active").
- Trigger a deletion pass (td.runDeletionPass).
- Observation: The tenant’s local data and the PendingDeleteRecord are removed before the external GCOM stack is confirmed deleted, leaving a mismatch between local state and GCOM. If the external stack is still active or partially deprovisioned, this can lead to data being unavailable in the future or potential data exposure due to partial cleanup.
If the system were to use the fix (post-merge):
- The deletion pass would call td.gcomAllowsTenantDeletion(ctx, tenantName) and would skip local deletion unless GCOM reports Status = "deleted" for the stack (extracted from tenant namespace, e.g., stacks-<id>).
- Only after GCOM confirms deletion would the local data be removed.
Simplified pseudo-code illustrating the race (pre-fix):
package main
func main() {
// Setup: GCOM is nil or reports non-deleted for the stack
td := NewTenantDeleter(ds, pds, TenantDeleterConfig{ Gcom: nil }) // or Gcom returning Status != "deleted"
// Seed tenant resources and a pending delete record
saveTestResource("stacks-1", ...)
pds.Upsert(context.Background(), "stacks-1", PendingDeleteRecord{ DeleteAfter: time.Now().Add(-time.Minute) })
// Run deletion pass
td.runDeletionPass(context.Background())
// Expectation (vulnerability): local data and the pending delete record are removed despite external stack state
}
Post-fix behavior (desired):
- The call to td.gcomAllowsTenantDeletion(...) would query GCOM, see that Status != "deleted" (or error), and skip deletion until GCOM reports the stack as deleted.
- Data remains until GCOM confirms the stack is removed, preventing race-related inconsistencies.
Commit Details
Author: maicon
Date: 2026-04-06 14:39 UTC
Message:
Unified storage: confirm stack deleted in GCOM before tenant data removal (#121697)
* feat(unified-storage): verify GCOM before tenant data deletion
Made-with: Cursor
* don't delete tenant data if gcom is not configured
Signed-off-by: Maicon Costa <maiconscosta@gmail.com>
* log errors close to where they happen
Signed-off-by: Maicon Costa <maiconscosta@gmail.com>
---------
Signed-off-by: Maicon Costa <maiconscosta@gmail.com>
Triage Assessment
Vulnerability Type: Information Disclosure / Race Condition
Confidence: HIGH
Reasoning:
The commit adds a GCOM-based guard to prevent local tenant data deletion until the external GCOM stack is confirmed deleted. This reduces the risk of data being left behind or exposed when an external stack is still present, addressing potential information disclosure and race conditions between local deletion and external state. The changes include new logic to verify GCOM state before deletion and tests around this behavior.
Verification Assessment
Vulnerability Type: Race Condition
Confidence: HIGH
Affected Versions: < 12.4.0
Code Diff
diff --git a/pkg/services/gcom/gcom.go b/pkg/services/gcom/gcom.go
index 11a18ced5dc76..94de598998dbc 100644
--- a/pkg/services/gcom/gcom.go
+++ b/pkg/services/gcom/gcom.go
@@ -27,6 +27,7 @@ type Instance struct {
RegionSlug string `json:"regionSlug"`
ClusterSlug string `json:"clusterSlug"`
OrgId int `json:"orgId"`
+ Status string `json:"status"`
}
type Plugin struct {
diff --git a/pkg/storage/unified/resource/tenant_deleter.go b/pkg/storage/unified/resource/tenant_deleter.go
index 83f0181964ab3..0e30401d546c4 100644
--- a/pkg/storage/unified/resource/tenant_deleter.go
+++ b/pkg/storage/unified/resource/tenant_deleter.go
@@ -3,9 +3,13 @@ package resource
import (
"context"
"math/rand/v2"
+ "strconv"
"time"
+ "github.com/grafana/authlib/types"
"github.com/grafana/grafana/pkg/infra/log"
+ "github.com/grafana/grafana/pkg/infra/tracing"
+ "github.com/grafana/grafana/pkg/services/gcom"
"github.com/grafana/grafana/pkg/setting"
)
@@ -14,6 +18,9 @@ type TenantDeleterConfig struct {
DryRun bool
Interval time.Duration
Log log.Logger
+ // Gcom, when non-nil, is used to confirm the stack is removed in GCOM before local
+ // tenant data is deleted: GetInstanceByID returns Instance with Status "deleted".
+ Gcom gcom.Service
}
// NewTenantDeleterConfig creates a TenantDeleterConfig from Grafana settings and returns nil
@@ -42,6 +49,7 @@ type TenantDeleter struct {
pendingDeleteStore *PendingDeleteStore
dataStore *dataStore
cfg TenantDeleterConfig
+ gcom gcom.Service
stopCh chan struct{}
}
@@ -52,6 +60,7 @@ func NewTenantDeleter(ds *dataStore, pds *PendingDeleteStore, cfg TenantDeleterC
pendingDeleteStore: pds,
dataStore: ds,
cfg: cfg,
+ gcom: cfg.Gcom,
stopCh: make(chan struct{}),
}
}
@@ -126,7 +135,9 @@ func (td *TenantDeleter) runDeletionPass(ctx context.Context) {
continue
}
- // TODO add check to verify with GCOM that tenant is deleted
+ if !td.gcomAllowsTenantDeletion(ctx, tenantName) {
+ continue
+ }
if err := td.deleteTenant(ctx, tenantName, groupResources); err != nil {
td.log.Error("failed to delete tenant data", "tenant", tenantName, "error", err)
@@ -134,6 +145,38 @@ func (td *TenantDeleter) runDeletionPass(ctx context.Context) {
}
}
+// gcomAllowsTenantDeletion returns true when GCOM returns 200 with Status "deleted" for the given
+// tenant name. Otherwise it returns false and logs.
+func (td *TenantDeleter) gcomAllowsTenantDeletion(ctx context.Context, tenantName string) bool {
+ if td.gcom == nil {
+ return false
+ }
+
+ info, err := types.ParseNamespace(tenantName)
+ if err != nil || info.StackID <= 0 {
+ td.log.Error("GCOM verification requires a cloud stack namespace (stacks-{stackId}); skipping local data deletion",
+ "tenant", tenantName, "err", err, "stack_id", info.StackID)
+ return false
+ }
+ instanceID := strconv.FormatInt(info.StackID, 10)
+
+ reqID := tracing.TraceIDFromContext(ctx, false)
+ inst, err := td.gcom.GetInstanceByID(ctx, reqID, instanceID)
+ if err != nil {
+ td.log.Error("GCOM instance check failed; skipping local data deletion", "tenant", tenantName, "gcom_instance_id", instanceID, "err", err)
+ return false
+ }
+
+ if inst.Status != "deleted" {
+ td.log.Warn("stack still active in GCOM; skipping local data deletion",
+ "tenant", tenantName, "gcom_instance_id", instanceID, "gcom_status", inst.Status)
+ return false
+ }
+
+ // If we get here, the stack is deleted in GCOM.
+ return true
+}
+
// deleteTenant removes all resource data for the given tenant from the data
// store, then removes its pending-delete record.
func (td *TenantDeleter) deleteTenant(ctx context.Context, tenantName string, groupResources []GroupResource) error {
diff --git a/pkg/storage/unified/resource/tenant_deleter_test.go b/pkg/storage/unified/resource/tenant_deleter_test.go
index d028c4a16e204..932530c5c0213 100644
--- a/pkg/storage/unified/resource/tenant_deleter_test.go
+++ b/pkg/storage/unified/resource/tenant_deleter_test.go
@@ -10,10 +10,33 @@ import (
"time"
"github.com/grafana/grafana/pkg/infra/log"
+ "github.com/grafana/grafana/pkg/services/gcom"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
+// testGcomVerifier implements gcom.Service for tests. Default behavior: instance not in GCOM.
+type testGcomVerifier struct {
+ getInstance func(ctx context.Context, requestID, instanceID string) (gcom.Instance, error)
+}
+
+func (v *testGcomVerifier) GetInstanceByID(ctx context.Context, requestID, instanceID string) (gcom.Instance, error) {
+ if v.getInstance != nil {
+ return v.getInstance(ctx, requestID, instanceID)
+ }
+ return gcom.Instance{}, fmt.Errorf("instance not found")
+}
+
+func (v *testGcomVerifier) GetPlugins(ctx context.Context, requestID string) (map[string]gcom.Plugin, error) {
+ return map[string]gcom.Plugin{}, nil
+}
+
+// testStacksNS1 and testStacksNS2 are authlib-parseable cloud namespaces (StackID > 0).
+const (
+ testStacksNS1 = "stacks-1"
+ testStacksNS2 = "stacks-2"
+)
+
// failOnceBatchDeleteKV wraps a KV and makes BatchDelete fail on the Nth call,
// then succeed on all subsequent calls. This simulates a transient failure
// partway through a deletion pass.
@@ -63,6 +86,14 @@ func newTestTenantDeleter(t *testing.T, dryRun bool) (*TenantDeleter, *dataStore
DryRun: dryRun,
Interval: time.Hour,
Log: log.NewNopLogger(),
+ Gcom: &testGcomVerifier{
+ getInstance: func(ctx context.Context, requestID, instanceID string) (gcom.Instance, error) {
+ if instanceID == "1" {
+ return gcom.Instance{ID: 1, Slug: "test", Status: "deleted"}, nil
+ }
+ return gcom.Instance{}, fmt.Errorf("instance not found")
+ },
+ },
}
td := NewTenantDeleter(ds, pds, cfg)
return td, ds, pds
@@ -81,16 +112,16 @@ func futureTime() string {
func TestRunDeletionPass_SkipsNotExpired(t *testing.T) {
td, ds, pds := newTestTenantDeleter(t, false)
- saveTestResource(t, ds, "tenant-1", "apps", "dashboards", "dash1", 100, nil)
+ saveTestResource(t, ds, testStacksNS1, "apps", "dashboards", "dash1", 100, nil)
- require.NoError(t, pds.Upsert(t.Context(), "tenant-1", PendingDeleteRecord{
+ require.NoError(t, pds.Upsert(t.Context(), testStacksNS1, PendingDeleteRecord{
DeleteAfter: futureTime(),
}))
td.runDeletionPass(t.Context())
// Data should still be present.
- listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: "tenant-1"}
+ listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: testStacksNS1}
prefix := listKey.Prefix()
var count int
for _, err := range ds.kv.Keys(t.Context(), dataSection, ListOptions{
@@ -103,7 +134,7 @@ func TestRunDeletionPass_SkipsNotExpired(t *testing.T) {
assert.Equal(t, 1, count, "resource should still exist for non-expired tenant")
// Pending delete record should still be there.
- _, err := pds.Get(t.Context(), "tenant-1")
+ _, err := pds.Get(t.Context(), testStacksNS1)
assert.NoError(t, err)
}
@@ -112,17 +143,17 @@ func TestRunDeletionPass_SkipsNotExpired(t *testing.T) {
func TestRunDeletionPass_DeletesExpired(t *testing.T) {
td, ds, pds := newTestTenantDeleter(t, false)
- saveTestResource(t, ds, "tenant-1", "apps", "dashboards", "dash1", 100, nil)
- saveTestResource(t, ds, "tenant-1", "apps", "dashboards", "dash2", 101, nil)
+ saveTestResource(t, ds, testStacksNS1, "apps", "dashboards", "dash1", 100, nil)
+ saveTestResource(t, ds, testStacksNS1, "apps", "dashboards", "dash2", 101, nil)
- require.NoError(t, pds.Upsert(t.Context(), "tenant-1", PendingDeleteRecord{
+ require.NoError(t, pds.Upsert(t.Context(), testStacksNS1, PendingDeleteRecord{
DeleteAfter: pastTime(),
}))
td.runDeletionPass(t.Context())
- // All data keys for tenant-1 should be gone.
- listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: "tenant-1"}
+ // All data keys for the expired tenant should be gone.
+ listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: testStacksNS1}
prefix := listKey.Prefix()
var count int
for _, err := range ds.kv.Keys(t.Context(), dataSection, ListOptions{
@@ -135,7 +166,7 @@ func TestRunDeletionPass_DeletesExpired(t *testing.T) {
assert.Equal(t, 0, count, "resources should have been deleted for expired tenant")
// Pending delete record should be gone.
- _, err := pds.Get(t.Context(), "tenant-1")
+ _, err := pds.Get(t.Context(), testStacksNS1)
assert.ErrorIs(t, err, ErrNotFound)
}
@@ -144,21 +175,21 @@ func TestRunDeletionPass_DeletesExpired(t *testing.T) {
func TestRunDeletionPass_PreservesOtherTenants(t *testing.T) {
td, ds, pds := newTestTenantDeleter(t, false)
- // tenant-1 is expired; tenant-2 is not.
- saveTestResource(t, ds, "tenant-1", "apps", "dashboards", "dash1", 100, nil)
- saveTestResource(t, ds, "tenant-2", "apps", "dashboards", "dash2", 101, nil)
+ // First tenant is expired; second is not.
+ saveTestResource(t, ds, testStacksNS1, "apps", "dashboards", "dash1", 100, nil)
+ saveTestResource(t, ds, testStacksNS2, "apps", "dashboards", "dash2", 101, nil)
- require.NoError(t, pds.Upsert(t.Context(), "tenant-1", PendingDeleteRecord{
+ require.NoError(t, pds.Upsert(t.Context(), testStacksNS1, PendingDeleteRecord{
DeleteAfter: pastTime(),
}))
- require.NoError(t, pds.Upsert(t.Context(), "tenant-2", PendingDeleteRecord{
+ require.NoError(t, pds.Upsert(t.Context(), testStacksNS2, PendingDeleteRecord{
DeleteAfter: futureTime(),
}))
td.runDeletionPass(t.Context())
- // tenant-1 data gone.
- listKey1 := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: "tenant-1"}
+ // First tenant's data gone.
+ listKey1 := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: testStacksNS1}
prefix1 := listKey1.Prefix()
var count1 int
for _, err := range ds.kv.Keys(t.Context(), dataSection, ListOptions{
@@ -168,10 +199,10 @@ func TestRunDeletionPass_PreservesOtherTenants(t *testing.T) {
require.NoError(t, err)
count1++
}
- assert.Equal(t, 0, count1, "tenant-1 resources should have been deleted")
+ assert.Equal(t, 0, count1, "first tenant resources should have been deleted")
- // tenant-2 data intact.
- listKey2 := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: "tenant-2"}
+ // Second tenant's data intact.
+ listKey2 := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: testStacksNS2}
prefix2 := listKey2.Prefix()
var count2 int
for _, err := range ds.kv.Keys(t.Context(), dataSection, ListOptions{
@@ -181,13 +212,13 @@ func TestRunDeletionPass_PreservesOtherTenants(t *testing.T) {
require.NoError(t, err)
count2++
}
- assert.Equal(t, 1, count2, "tenant-2 resources should remain")
+ assert.Equal(t, 1, count2, "second tenant resources should remain")
- // tenant-1 pending delete record gone; tenant-2 record still exists.
- _, err := pds.Get(t.Context(), "tenant-1")
+ // First tenant's pending delete record gone; second tenant's record still exists.
+ _, err := pds.Get(t.Context(), testStacksNS1)
assert.ErrorIs(t, err, ErrNotFound)
- _, err = pds.Get(t.Context(), "tenant-2")
+ _, err = pds.Get(t.Context(), testStacksNS2)
assert.NoError(t, err)
}
@@ -196,16 +227,16 @@ func TestRunDeletionPass_PreservesOtherTenants(t *testing.T) {
func TestRunDeletionPass_DryRun(t *testing.T) {
td, ds, pds := newTestTenantDeleter(t, true /* dryRun */)
- saveTestResource(t, ds, "tenant-1", "apps", "dashboards", "dash1", 100, nil)
+ saveTestResource(t, ds, testStacksNS1, "apps", "dashboards", "dash1", 100, nil)
- require.NoError(t, pds.Upsert(t.Context(), "tenant-1", PendingDeleteRecord{
+ require.NoError(t, pds.Upsert(t.Context(), testStacksNS1, PendingDeleteRecord{
DeleteAfter: pastTime(),
}))
td.runDeletionPass(t.Context())
// Data should still be present.
- listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: "tenant-1"}
+ listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: testStacksNS1}
prefix := listKey.Prefix()
var count int
for _, err := range ds.kv.Keys(t.Context(), dataSection, ListOptions{
@@ -218,7 +249,7 @@ func TestRunDeletionPass_DryRun(t *testing.T) {
assert.Equal(t, 1, count, "data should not be deleted in dry-run mode")
// Pending delete record should still exist.
- _, err := pds.Get(t.Context(), "tenant-1")
+ _, err := pds.Get(t.Context(), testStacksNS1)
assert.NoError(t, err)
}
@@ -228,14 +259,14 @@ func TestRunDeletionPass_NoDataCleansUpRecord(t *testing.T) {
td, _, pds := newTestTenantDeleter(t, false)
// No resources saved — only a pending-delete record.
- require.NoError(t, pds.Upsert(t.Context(), "tenant-1", PendingDeleteRecord{
+ require.NoError(t, pds.Upsert(t.Context(), testStacksNS1, PendingDeleteRecord{
DeleteAfter: pastTime(),
}))
td.runDeletionPass(t.Context())
// Pending delete record should be removed.
- _, err := pds.Get(t.Context(), "tenant-1")
+ _, err := pds.Get(t.Context(), testStacksNS1)
assert.ErrorIs(t, err, ErrNotFound)
}
@@ -244,24 +275,24 @@ func TestRunDeletionPass_NoDataCleansUpRecord(t *testing.T) {
func TestRunDeletionPass_IdempotentRerun(t *testing.T) {
td, ds, pds := newTestTenantDeleter(t, false)
- saveTestResource(t, ds, "tenant-1", "apps", "dashboards", "dash1", 100, nil)
- saveTestResource(t, ds, "tenant-1", "apps", "dashboards", "dash2", 101, nil)
+ saveTestResource(t, ds, testStacksNS1, "apps", "dashboards", "dash1", 100, nil)
+ saveTestResource(t, ds, testStacksNS1, "apps", "dashboards", "dash2", 101, nil)
- require.NoError(t, pds.Upsert(t.Context(), "tenant-1", PendingDeleteRecord{
+ require.NoError(t, pds.Upsert(t.Context(), testStacksNS1, PendingDeleteRecord{
DeleteAfter: pastTime(),
}))
// First pass deletes everything.
td.runDeletionPass(t.Context())
- _, err := pds.Get(t.Context(), "tenant-1")
+ _, err := pds.Get(t.Context(), testStacksNS1)
require.ErrorIs(t, err, ErrNotFound)
// Second pass should be a no-op (no record, no data).
td.runDeletionPass(t.Context())
// Verify data is still gone.
- listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: "tenant-1"}
+ listKey := ListRequestKey{Group: "apps", Resource: "dashboards", Namespace: testStacksNS1}
prefix := listKey.Prefix()
var count int
for _, err := range ds.kv.Keys(t.Context(), dataSection, ListOptions{
@@ -289,14 +320,22 @@ func TestRunDeletionPass_IdempotentAfterPartialFailure(t *testing.T) {
DryRun: false,
Interval: time.Hour,
Log: log.NewNopLogger(),
+ Gcom: &testGcomVerifier{
+ getInstance: func(ctx context.Context, requestID, instanceID string) (gcom.Instance, error) {
+ if instanceID == "1" {
+ return gcom.Instance{ID: 1, Slug: "test", Status: "deleted"}, nil
+ }
+ return gcom.Instance{}, fmt.Errorf("instance not found")
+ },
+ },
}
t
... [truncated]