Azure Monitor — Enterprise Alerting Strategy

Building Effective Alert Systems


Introduction

Effective alerting is the bridge between observing system behavior and taking action. Too few alerts mean missed problems; too many create noise and alert fatigue. For Azure integration workloads spanning Functions, Service Bus, API Management, and more, an enterprise alerting strategy ensures the right people are notified at the right time about the right issues.

This comprehensive guide covers:

  • Alert types — Metrics, logs, and activity
  • Alert design — Building effective alerts
  • Routing — Getting alerts to the right people
  • Automation — Auto-remediation
  • Optimization — Reducing alert fatigue

Alert Types

Azure Monitor Alert Types

┌─────────────────────────────────────────────────────────────────────┐
│                  ALERT TYPES                                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   METRIC ALERTS                                                     │
│   ─────────────                                                     │
│   Triggered when metric crosses threshold                           │
│   Examples: CPU > 80%, Queue depth > 1000, Errors > 10              │
│   Use for: Real-time performance issues                             │
│   Response: Immediate action required                               │
│                                                                     │
│   LOG ALERTS                                                        │
│   ──────────                                                        │
│   Triggered when log query matches criteria                         │
│   Examples: Error count in last hour, Exception pattern             │
│   Use for: Application-level issues                                 │
│   Response: Investigation required                                  │
│                                                                     │
│   ACTIVITY LOG ALERTS                                               │
│   ───────────────────                                               │
│   Triggered on Azure resource operations                            │
│   Examples: New resource created, Config changed, ServiceHealth     │
│   Use for: Audit and governance                                     │
│   Response: May require follow-up                                   │
│                                                                     │
│   SMART DETECTION                                                   │
│   ────────────────                                                  │
│   ML-based anomaly detection                                        │
│   Examples: Unusual traffic patterns, Failure patterns              │
│   Use for: Proactive issue detection                                │
│   Response: Review and investigate                                  │
│                                                                     │
│   SERVICE HEALTH                                                    │
│   ─────────────                                                     │
│   Azure service issues affecting resources                          │
│   Examples: Service degradation, Planned maintenance                │
│   Response: Monitor and communicate                                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Alert Configuration

# Create metric alert for Service Bus queue depth
az monitor metrics alert create \
  --name "servicebus-queue-depth" \
  --resource-group rg-integration \
  --condition "avg ActiveMessages > 1000" \
  --description "Service Bus queue depth exceeded threshold" \
  --evaluation-frequency 5m \
  --window-size 10m \
  --action-group /subscriptions/xxx/resourceGroups/rg-monitoring/providers/Microsoft.Insights/actionGroups/platform-alerts

# Create log alert for function failures
az monitor log-analytics alert create \
  --name "function-failures" \
  --resource-group rg-integration \
  --query "requests | where success == false | where timestamp > ago(1h) | count" \
  --threshold 10 \
  --operator GreaterThan \
  --description "Function failure count exceeded" \
  --action-group platform-alerts

Alert Design Principles

Designing Effective Alerts

{
  "alertDesign": {
    "requirements": [
      "Actionable: Recipient knows what to do",
      "Relevant: Actually indicates a problem",
      "Timely: Detected quickly enough to matter",
      "Unique: Not duplicating other alerts",
      "Clear: Description explains the issue"
    ],
    "components": {
      "condition": "What triggers the alert",
      "severity": "How urgent is this?",
      "description": "What does this mean?",
      "action": "What should be done?",
      "runbook": "Where to find resolution steps"
    }
  }
}

Severity Levels

{
  "severityLevels": {
    "Sev1_Critical": {
      "description": "Complete service outage",
      "examples": [
        "Production API completely unavailable",
        "All messages failing to process",
        "Data loss or corruption"
      ],
      "response": "Immediate page, war room",
      "SLA": "15 minutes"
    },
    "Sev2_High": {
      "description": "Major impact, degraded service",
      "examples": [
        "High error rate (>10%)",
        "Significant latency increase",
        "Single region failure"
      ],
      "response": "Page within 30 minutes",
      "SLA": "1 hour"
    },
    "Sev3_Medium": {
      "description": "Minor impact, needs attention",
      "examples": [
        "Elevated error rate",
        "Non-critical function failing",
        "Approaching capacity limits"
      ],
      "response": "Email during business hours",
      "SLA": "4 hours"
    },
    "Sev4_Low": {
      "description": "Informational, no immediate impact",
      "examples": [
        "Threshold warnings",
        "Configuration changes",
        "Health check failures"
      ],
      "response": "Dashboard ticket",
      "SLA": "Next business day"
    }
  }
}

Action Groups and Routing

Action Group Configuration

{
  "actionGroups": {
    "platform-oncall": {
      "type": "Email",
      "recipients": ["oncall-platform@company.com"],
      "useCommonAlertSchema": true,
      "enabled": true
    },
    "security-team": {
      "type": "Email",
      "recipients": ["security@company.com"],
      "webhook": "https://company.com/api/security-alerts"
    },
    "urgent-page": {
      "type": "SMS",
      "recipients": ["+15551234567"],
      "enabled": true
    },
    "auto-remediation": {
      "type": "Webhook",
      "url": "https://company.com/api/remediate"
    }
  }
}

Alert Routing Logic

{
  "routingRules": {
    "serviceBus_alerts": {
      "condition": "resourceType == Microsoft.ServiceBus",
      "actions": ["platform-oncall", "integration-team"],
      "runbook": "servicebus-runbook.md"
    },
    "function_critical": {
      "condition": "metric == Exceptions AND value > 50",
      "actions": ["urgent-page", "platform-oncall"],
      "runbook": "function-runbook.md"
    },
    "apiManagement_health": {
      "condition": "resourceType == Microsoft.ApiManagement",
      "actions": ["api-team"],
      "runbook": "apim-runbook.md"
    }
  }
}

Auto-Remediation

Alert-Based Automation

public class AlertAutomation
{
    [Function("AlertWebhook")]
    public async Task Run(
        [HttpTrigger(WebHookType = "genericJson")] HttpRequest req)
    {
        var alert = await req.ReadAsJsonAsync<MonitorAlert>();

        switch (alert.AlertType)
        {
            case "QueueDepthExceeded":
                await HandleQueueDepthAsync(alert);
                break;
            case "FunctionFailureSpike":
                await HandleFunctionFailuresAsync(alert);
                break;
            case "StorageCapacityWarning":
                await HandleStorageCapacityAsync(alert);
                break;
        }
    }

    private async Task HandleQueueDepthAsync(MonitorAlert alert)
    {
        // Check if we can process more
        var currentCapacity = await GetConsumerCapacityAsync();

        if (currentCapacity < 80)
        {
            // Scale up consumers
            await ScaleUpConsumerAsync(2);
        }
        else
        {
            // Alert is real - notify team
            await NotifyTeamAsync(alert);
        }
    }
}

Logic App Alert Handler

{
  "logicAppAlertHandler": {
    "trigger": {
      "type": "When a HTTP request is received",
      "schema": "Alert schema"
    },
    "actions": [
      {
        "condition": "Severity == Critical",
        "actions": [
          {
            "type": "Send email",
            "to": "oncall@company.com",
            "subject": "CRITICAL: {{alert.name}}"
          },
          {
            "type": "Create incident",
            "system": "pagerduty"
          }
        ]
      },
      {
        "condition": "Severity == Warning",
        "actions": [
          {
            "type": "Send Teams message",
            "channel": "integration-alerts"
          }
        ]
      }
    ]
  }
}

Optimization and Tuning

Reducing Alert Fatigue

┌─────────────────────────────────────────────────────────────────────┐
│                  ALERT OPTIMIZATION                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   COMMON PROBLEMS              SOLUTIONS                            │
│   ───────────────────────────────────────────────────────────────── │
│   Too many warnings            Raise thresholds                     │
│   Duplicate alerts            Group by resource/service             │
│   Noise from flapping          Use longer evaluation windows        │
│   Not actionable               Improve condition definition         │
│   Wrong people notified        Update action groups                 │
│   No runbooks                  Create and link runbooks             │
│                                                                     │
│   ───────────────────────────────────────────────────────────────── │
│                                                                     │
│   METRICS TO TRACK:                                                 │
│   • Alert volume over time                                          │
│   • % of actionable alerts                                          │
│   • Time to acknowledge                                             │
│   • Time to resolve                                                 │
│   • Alert storms                                                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Alert Tuning Process

public class AlertTuner
{
    public async Task<TuningRecommendation> AnalyzeAlertAsync(
        string alertName,
        TimeRange period)
    {
        var alerts = await GetAlertHistoryAsync(alertName, period);

        var actionable = alerts.Where(a => a.RequiredAction).Count();
        var noise = alerts.Count() - actionable;
        var noiseRate = (double)noise / alerts.Count();

        var recommendation = new TuningRecommendation();

        if (noiseRate > 0.5)
        {
            recommendation.Action = "Increase threshold or use longer window";
            recommendation.Reason = $"{noiseRate:P0} are noise";
        }
        else if (noiseRate > 0.2)
        {
            recommendation.Action = "Review condition - consider grouping";
            recommendation.Reason = $"{noiseRate:P0} need refinement";
        }

        return recommendation;
    }
}

Best Practices

Implementation Checklist

PracticeDescription
Define severity clearlyKnow what each level means
Create runbooksDocument response procedures
Test alertsVerify alerts fire correctly
Review regularlyTune based on actual behavior
Use action groupsGroup by team/response
Enable auto-remediationWhere safe and appropriate

Common Mistakes

┌─────────────────────────────────────────────────────────────────────┐
│                  ALERT MISTAKES                                     │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ✗ Alerting on everything                                          │
│   ✗ No severity levels                                              │
│   ✗ Ignoring alert fatigue                                          │
│   ✗ No runbooks                                                     │
│   ✗ Wrong notification recipients                                   │
│   ✗ Noisy conditions (too sensitive)                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Related Topics


Azure Integration Hub - Architect Level Observability & Operations at Scale