Azure Monitor — Enterprise Alerting Strategy
Building Effective Alert Systems
Introduction
Effective alerting is the bridge between observing system behavior and taking action. Too few alerts mean missed problems; too many create noise and alert fatigue. For Azure integration workloads spanning Functions, Service Bus, API Management, and more, an enterprise alerting strategy ensures the right people are notified at the right time about the right issues.
This comprehensive guide covers:
- Alert types — Metrics, logs, and activity
- Alert design — Building effective alerts
- Routing — Getting alerts to the right people
- Automation — Auto-remediation
- Optimization — Reducing alert fatigue
Alert Types
Azure Monitor Alert Types
┌─────────────────────────────────────────────────────────────────────┐
│ ALERT TYPES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ METRIC ALERTS │
│ ───────────── │
│ Triggered when metric crosses threshold │
│ Examples: CPU > 80%, Queue depth > 1000, Errors > 10 │
│ Use for: Real-time performance issues │
│ Response: Immediate action required │
│ │
│ LOG ALERTS │
│ ────────── │
│ Triggered when log query matches criteria │
│ Examples: Error count in last hour, Exception pattern │
│ Use for: Application-level issues │
│ Response: Investigation required │
│ │
│ ACTIVITY LOG ALERTS │
│ ─────────────────── │
│ Triggered on Azure resource operations │
│ Examples: New resource created, Config changed, ServiceHealth │
│ Use for: Audit and governance │
│ Response: May require follow-up │
│ │
│ SMART DETECTION │
│ ──────────────── │
│ ML-based anomaly detection │
│ Examples: Unusual traffic patterns, Failure patterns │
│ Use for: Proactive issue detection │
│ Response: Review and investigate │
│ │
│ SERVICE HEALTH │
│ ───────────── │
│ Azure service issues affecting resources │
│ Examples: Service degradation, Planned maintenance │
│ Response: Monitor and communicate │
│ │
└─────────────────────────────────────────────────────────────────────┘
Alert Configuration
# Create metric alert for Service Bus queue depth
az monitor metrics alert create \
--name "servicebus-queue-depth" \
--resource-group rg-integration \
--condition "avg ActiveMessages > 1000" \
--description "Service Bus queue depth exceeded threshold" \
--evaluation-frequency 5m \
--window-size 10m \
--action-group /subscriptions/xxx/resourceGroups/rg-monitoring/providers/Microsoft.Insights/actionGroups/platform-alerts
# Create log alert for function failures
az monitor log-analytics alert create \
--name "function-failures" \
--resource-group rg-integration \
--query "requests | where success == false | where timestamp > ago(1h) | count" \
--threshold 10 \
--operator GreaterThan \
--description "Function failure count exceeded" \
--action-group platform-alerts
Alert Design Principles
Designing Effective Alerts
{
"alertDesign": {
"requirements": [
"Actionable: Recipient knows what to do",
"Relevant: Actually indicates a problem",
"Timely: Detected quickly enough to matter",
"Unique: Not duplicating other alerts",
"Clear: Description explains the issue"
],
"components": {
"condition": "What triggers the alert",
"severity": "How urgent is this?",
"description": "What does this mean?",
"action": "What should be done?",
"runbook": "Where to find resolution steps"
}
}
}
Severity Levels
{
"severityLevels": {
"Sev1_Critical": {
"description": "Complete service outage",
"examples": [
"Production API completely unavailable",
"All messages failing to process",
"Data loss or corruption"
],
"response": "Immediate page, war room",
"SLA": "15 minutes"
},
"Sev2_High": {
"description": "Major impact, degraded service",
"examples": [
"High error rate (>10%)",
"Significant latency increase",
"Single region failure"
],
"response": "Page within 30 minutes",
"SLA": "1 hour"
},
"Sev3_Medium": {
"description": "Minor impact, needs attention",
"examples": [
"Elevated error rate",
"Non-critical function failing",
"Approaching capacity limits"
],
"response": "Email during business hours",
"SLA": "4 hours"
},
"Sev4_Low": {
"description": "Informational, no immediate impact",
"examples": [
"Threshold warnings",
"Configuration changes",
"Health check failures"
],
"response": "Dashboard ticket",
"SLA": "Next business day"
}
}
}
Action Groups and Routing
Action Group Configuration
{
"actionGroups": {
"platform-oncall": {
"type": "Email",
"recipients": ["oncall-platform@company.com"],
"useCommonAlertSchema": true,
"enabled": true
},
"security-team": {
"type": "Email",
"recipients": ["security@company.com"],
"webhook": "https://company.com/api/security-alerts"
},
"urgent-page": {
"type": "SMS",
"recipients": ["+15551234567"],
"enabled": true
},
"auto-remediation": {
"type": "Webhook",
"url": "https://company.com/api/remediate"
}
}
}
Alert Routing Logic
{
"routingRules": {
"serviceBus_alerts": {
"condition": "resourceType == Microsoft.ServiceBus",
"actions": ["platform-oncall", "integration-team"],
"runbook": "servicebus-runbook.md"
},
"function_critical": {
"condition": "metric == Exceptions AND value > 50",
"actions": ["urgent-page", "platform-oncall"],
"runbook": "function-runbook.md"
},
"apiManagement_health": {
"condition": "resourceType == Microsoft.ApiManagement",
"actions": ["api-team"],
"runbook": "apim-runbook.md"
}
}
}
Auto-Remediation
Alert-Based Automation
public class AlertAutomation
{
[Function("AlertWebhook")]
public async Task Run(
[HttpTrigger(WebHookType = "genericJson")] HttpRequest req)
{
var alert = await req.ReadAsJsonAsync<MonitorAlert>();
switch (alert.AlertType)
{
case "QueueDepthExceeded":
await HandleQueueDepthAsync(alert);
break;
case "FunctionFailureSpike":
await HandleFunctionFailuresAsync(alert);
break;
case "StorageCapacityWarning":
await HandleStorageCapacityAsync(alert);
break;
}
}
private async Task HandleQueueDepthAsync(MonitorAlert alert)
{
// Check if we can process more
var currentCapacity = await GetConsumerCapacityAsync();
if (currentCapacity < 80)
{
// Scale up consumers
await ScaleUpConsumerAsync(2);
}
else
{
// Alert is real - notify team
await NotifyTeamAsync(alert);
}
}
}
Logic App Alert Handler
{
"logicAppAlertHandler": {
"trigger": {
"type": "When a HTTP request is received",
"schema": "Alert schema"
},
"actions": [
{
"condition": "Severity == Critical",
"actions": [
{
"type": "Send email",
"to": "oncall@company.com",
"subject": "CRITICAL: {{alert.name}}"
},
{
"type": "Create incident",
"system": "pagerduty"
}
]
},
{
"condition": "Severity == Warning",
"actions": [
{
"type": "Send Teams message",
"channel": "integration-alerts"
}
]
}
]
}
}
Optimization and Tuning
Reducing Alert Fatigue
┌─────────────────────────────────────────────────────────────────────┐
│ ALERT OPTIMIZATION │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ COMMON PROBLEMS SOLUTIONS │
│ ───────────────────────────────────────────────────────────────── │
│ Too many warnings Raise thresholds │
│ Duplicate alerts Group by resource/service │
│ Noise from flapping Use longer evaluation windows │
│ Not actionable Improve condition definition │
│ Wrong people notified Update action groups │
│ No runbooks Create and link runbooks │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ METRICS TO TRACK: │
│ • Alert volume over time │
│ • % of actionable alerts │
│ • Time to acknowledge │
│ • Time to resolve │
│ • Alert storms │
│ │
└─────────────────────────────────────────────────────────────────────┘
Alert Tuning Process
public class AlertTuner
{
public async Task<TuningRecommendation> AnalyzeAlertAsync(
string alertName,
TimeRange period)
{
var alerts = await GetAlertHistoryAsync(alertName, period);
var actionable = alerts.Where(a => a.RequiredAction).Count();
var noise = alerts.Count() - actionable;
var noiseRate = (double)noise / alerts.Count();
var recommendation = new TuningRecommendation();
if (noiseRate > 0.5)
{
recommendation.Action = "Increase threshold or use longer window";
recommendation.Reason = $"{noiseRate:P0} are noise";
}
else if (noiseRate > 0.2)
{
recommendation.Action = "Review condition - consider grouping";
recommendation.Reason = $"{noiseRate:P0} need refinement";
}
return recommendation;
}
}
Best Practices
Implementation Checklist
| Practice | Description |
|---|---|
| Define severity clearly | Know what each level means |
| Create runbooks | Document response procedures |
| Test alerts | Verify alerts fire correctly |
| Review regularly | Tune based on actual behavior |
| Use action groups | Group by team/response |
| Enable auto-remediation | Where safe and appropriate |
Common Mistakes
┌─────────────────────────────────────────────────────────────────────┐
│ ALERT MISTAKES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ✗ Alerting on everything │
│ ✗ No severity levels │
│ ✗ Ignoring alert fatigue │
│ ✗ No runbooks │
│ ✗ Wrong notification recipients │
│ ✗ Noisy conditions (too sensitive) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Related Topics
- SLI/SLO/SLA — Service levels
- Distributed Tracing — Observability
- Log Analytics — Logging
Azure Integration Hub - Architect Level Observability & Operations at Scale