Alert Rules & Action Groups

Overview

Azure Monitor Alerts proactively notify you when conditions in your telemetry indicate a problem. Combined with Action Groups, they form a complete incident notification pipeline — from detection to response.

Alert Types

TypeSignal SourceUse Case
Metric AlertPlatform metrics, custom metricsCPU > 80%, response time > 2s
Log AlertLog Analytics / KQL queryError count spike, specific exception
Activity Log AlertAzure control planeResource deleted, deployment failed
Smart DetectionApplication Insights MLAnomaly in failure rate, response time

Metric Alerts

Metric alerts evaluate platform or custom metrics at regular intervals:

Azure CLI — Create a Metric Alert

az monitor metrics alert create \
  --name "HighCPU-AppService" \
  --resource-group myRG \
  --scopes "/subscriptions/{sub}/resourceGroups/myRG/providers/Microsoft.Web/sites/myApp" \
  --condition "avg CpuPercentage > 80" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action "/subscriptions/{sub}/resourceGroups/myRG/providers/Microsoft.Insights/actionGroups/OpsTeam"

Bicep / ARM Template

resource cpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'HighCPU-AppService'
  location: 'global'
  properties: {
    severity: 2
    enabled: true
    scopes: [appService.id]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          name: 'CPUCheck'
          metricName: 'CpuPercentage'
          operator: 'GreaterThan'
          threshold: 80
          timeAggregation: 'Average'
        }
      ]
    }
    actions: [{ actionGroupId: actionGroup.id }]
  }
}

Log Alerts (KQL-Based)

Log alerts run a KQL query on a schedule and fire when results meet a threshold:

az monitor scheduled-query create \
  --name "HighErrorRate" \
  --resource-group myRG \
  --scopes "/subscriptions/{sub}/resourceGroups/myRG/providers/Microsoft.Insights/components/myAppInsights" \
  --condition "count 'GreaterThan' 50 resource id _ResourceId" \
  --condition-query "requests | where success == false | summarize count() by bin(timestamp, 5m)" \
  --window-size 5m \
  --evaluation-frequency 5m \
  --severity 1 \
  --action-groups "/subscriptions/{sub}/resourceGroups/myRG/providers/Microsoft.Insights/actionGroups/OpsTeam"

Common Log Alert Queries

Exception spike:

exceptions
| where timestamp > ago(5m)
| summarize exceptionCount = count() by type
| where exceptionCount > 10

Dependency failure rate:

dependencies
| where timestamp > ago(15m)
| summarize total = count(), failed = countif(success == false)
| extend failRate = (failed * 100.0) / total
| where failRate > 5

Action Groups

Action Groups define WHO gets notified and HOW:

Notification Types

TypeDescription
EmailSend to individual or distribution list
SMSText message to phone number
VoiceAutomated phone call
PushAzure mobile app notification
Azure FunctionTrigger a function for auto-remediation
Logic AppStart a workflow (create ticket, post to Teams)
WebhookPOST to any HTTP endpoint
ITSMServiceNow, Provance integration

Create an Action Group (CLI)

az monitor action-group create \
  --name "OpsTeam" \
  --resource-group myRG \
  --short-name "Ops" \
  --action email ops-lead ops@company.com \
  --action webhook pagerduty "https://events.pagerduty.com/integration/xxx/enqueue"

Bicep

resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
  name: 'OpsTeam'
  location: 'global'
  properties: {
    groupShortName: 'Ops'
    enabled: true
    emailReceivers: [
      { name: 'ops-lead', emailAddress: 'ops@company.com', useCommonAlertSchema: true }
    ]
    webhookReceivers: [
      { name: 'pagerduty', serviceUri: 'https://events.pagerduty.com/integration/xxx/enqueue', useCommonAlertSchema: true }
    ]
  }
}

Alert Processing Rules

Alert processing rules let you suppress or route alerts based on schedule or scope:

  • Suppression — Silence alerts during maintenance windows
  • Action group override — Route weekend alerts to on-call team
az monitor alert-processing-rule create \
  --name "MaintenanceWindow" \
  --resource-group myRG \
  --scopes "/subscriptions/{sub}/resourceGroups/myRG" \
  --rule-type RemoveAllActionGroups \
  --schedule-recurrence-type Weekly \
  --schedule-recurrence Sunday \
  --schedule-recurrence-start-time "02:00:00" \
  --schedule-recurrence-end-time "06:00:00" \
  --schedule-time-zone "UTC"

Best Practices

  1. Use severity levels consistently — Sev 0 = critical (page), Sev 2 = warning (email), Sev 4 = informational
  2. Avoid alert fatigue — Only page for actionable alerts; use email/Teams for warnings
  3. Set appropriate window sizes — Too short = noisy; too long = slow detection
  4. Use dynamic thresholds — ML-based thresholds adapt to patterns automatically
  5. Test action groups — Use the "Test" button in the portal to verify notifications work
  6. Use Common Alert Schema — Standardizes payload format across all alert types
  7. Document runbooks — Link each alert to a runbook describing remediation steps

Key Takeaways

  • Metric alerts are best for infrastructure signals; log alerts for application-level conditions
  • Action Groups decouple "what to detect" from "who to notify"
  • Use alert processing rules for maintenance windows and routing
  • Dynamic thresholds reduce manual threshold tuning
  • Every alert should have a clear owner and remediation runbook