Azure Integration Hub

01Monitoring & Observability Overview

Modern cloud systems are too complex to debug with guesswork. Observability is the ability to understand a system's internal state from its external outputs — the three pillars being logs, metrics, and traces. Azure's monitoring stack maps these pillars to specific services that work together:

🔭

Application Insights

APM (Application Performance Management) — request tracking, dependency calls, exceptions, custom events, user analytics, availability tests.

📋

Log Analytics

Centralised log store and query engine. KQL-powered queries across logs from every Azure service, custom apps, VMs, and containers.

📈

Azure Monitor

The platform umbrella — collects metrics from every Azure resource, evaluates alert rules, triggers action groups, and powers workbooks.

🕸️

Distributed Tracing

End-to-end request tracking across microservices. OpenTelemetry W3C trace context propagation — see the full call chain from gateway to database.

The Three Pillars of Observability

Pillar	What It Answers	Azure Service	Data Type
Logs	What happened? What were the inputs/outputs?	Log Analytics / App Insights	Structured JSON events
Metrics	How is the system performing right now?	Azure Monitor Metrics	Numeric time-series
Traces	Why did this request take so long? What path did it take?	App Insights / Jaeger / Tempo	Distributed spans

How the Stack Fits Together

Your Application (.NET 8 / Functions / AKS pods)
        │
        │  emits via OpenTelemetry SDK or App Insights SDK
        ▼
┌────────────────────────────────────────────────────────┐
│              Azure Monitor (the platform)              │
│                                                        │
│  ┌─────────────────────┐   ┌────────────────────────┐  │
│  │ Application Insights│   │   Log Analytics        │  │
│  │  (APM + traces +    │   │   Workspace            │  │
│  │   exceptions +      │   │   (logs from ALL       │  │
│  │   custom events)    │   │    Azure resources)    │  │
│  └─────────────────────┘   └────────────────────────┘  │
│                                                        │
│  ┌─────────────────────┐   ┌────────────────────────┐  │
│  │  Metrics Store      │   │   Alerts Engine        │  │
│  │  (time-series,      │   │   (rules, action       │  │
│  │   auto-collected    │   │    groups, PagerDuty)  │  │
│  │   from all services)│   │                        │  │
│  └─────────────────────┘   └────────────────────────┘  │
└────────────────────────────────────────────────────────┘
        │
        ▼
Workbooks · Dashboards · Grafana · Power BI

02Application Insights

Application Insights is Azure's Application Performance Management (APM)service. It automatically collects request telemetry, dependency calls (SQL, HTTP, Service Bus), exceptions, performance counters, and custom events — giving you end-to-end visibility into every user interaction and every background job.

Telemetry Type	Auto-Collected?	Description
Requests	✓ Yes	Every inbound HTTP request — duration, status, URL, method
Dependencies	✓ Yes	Outbound HTTP calls, SQL queries, Service Bus, Redis, Storage
Exceptions	✓ Yes	Unhandled exceptions with stack traces and request context
Traces	✓ Yes	ILogger output — severity, message, properties
Performance Counters	✓ Yes	CPU, memory, GC, thread count — from host OS
Custom Events	Manual	Business events — OrderPlaced, PaymentFailed, FeatureUsed
Custom Metrics	Manual	Numeric measurements — queue depth, cache hit rate
Page Views	JS SDK	Browser page load times, user sessions, demographics
Availability	Configured	Synthetic ping / multi-step tests from global locations

02aSDK Setup & Auto-Instrumentation

Getting Application Insights into your .NET 8 service takes just a few lines of configuration — the SDK handles auto-instrumentation of HTTP requests, dependency calls, and exceptions out of the box. The modern approach uses the OpenTelemetry-based Azure Monitor exporter, which gives you vendor-neutral instrumentation with Azure-native export. Always use connection strings (not instrumentation keys) and store them in Key Vault — never hardcode secrets in your application code. You can also enrich every telemetry item with custom properties using ITelemetryInitializer, which is invaluable for filtering by service version, environment, or correlation IDs during incident investigation.

Connection-String Based Setup (.NET 8)

// NuGet: Microsoft.ApplicationInsights.AspNetCore
// NuGet: Azure.Monitor.OpenTelemetry.AspNetCore  ← modern OpenTelemetry-based

// ── Option A: Azure Monitor OpenTelemetry (recommended for new projects) ──
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .UseAzureMonitor(options =>
    {
        // Connection string from Key Vault / App Settings — never hardcode
        options.ConnectionString =
            builder.Configuration["ApplicationInsights:ConnectionString"];

        // Sampling — see Section 02c
        options.SamplingRatio = 0.1f; // 10% sampling in production
    });

// ── Option B: Classic App Insights SDK ────────────────────────────────
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString =
        builder.Configuration["ApplicationInsights:ConnectionString"];
    options.EnableAdaptiveSampling = true;
    options.EnableHeartbeat        = true;
    options.EnableDebugLogger      = false; // Off in production
});

// ── appsettings.json ──────────────────────────────────────────────────
// {
//   "ApplicationInsights": {
//     "ConnectionString": "@Microsoft.KeyVault(VaultName=myVault;SecretName=ai-conn-string)"
//   }
// }
//
// Connection string format:
// InstrumentationKey=<key>;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;...

Worker Service / Background Jobs

// NuGet: Microsoft.ApplicationInsights.WorkerService
builder.Services.AddApplicationInsightsTelemetryWorkerService(options =>
{
    options.ConnectionString =
        builder.Configuration["ApplicationInsights:ConnectionString"];
});

// Azure Functions — auto-configured via host.json
// host.json:
// {
//   "logging": {
//     "applicationInsights": {
//       "samplingSettings": { "isEnabled": true, "maxTelemetryItemsPerSecond": 20 }
//     }
//   }
// }

Enriching All Telemetry with Custom Properties

// ITelemetryInitializer — runs on every telemetry item before it's sent
public class ServiceTelemetryInitializer : ITelemetryInitializer
{
    private readonly IHttpContextAccessor _http;

    public ServiceTelemetryInitializer(IHttpContextAccessor http) => _http = http;

    public void Initialize(ITelemetry telemetry)
    {
        // Tag every item with service metadata
        telemetry.Context.Cloud.RoleName     = "orders-api";
        telemetry.Context.Cloud.RoleInstance = Environment.MachineName;

        // Add custom global dimensions
        if (telemetry is ISupportProperties props)
        {
            props.Properties["ServiceVersion"] = "1.4.2";
            props.Properties["Environment"]    = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") ?? "Unknown";

            // Propagate correlation IDs from inbound request headers
            var ctx = _http.HttpContext;
            if (ctx is not null)
            {
                var correlationId = ctx.Request.Headers["X-Correlation-Id"].FirstOrDefault();
                if (!string.IsNullOrEmpty(correlationId))
                    props.Properties["CorrelationId"] = correlationId;

                var userId = ctx.User.FindFirst("oid")?.Value;
                if (!string.IsNullOrEmpty(userId))
                    props.Properties["UserId"] = userId;
            }
        }
    }
}

// Register in DI
services.AddSingleton<ITelemetryInitializer, ServiceTelemetryInitializer>();

02bCustom Telemetry

While auto-instrumentation captures HTTP requests and dependencies, custom telemetry lets you track what matters to your business — events like "OrderPlaced" or "PaymentFailed", custom metrics like queue depth or cache hit rate, and manual dependency tracking for non-HTTP calls. This is where observability becomes truly powerful: you can correlate technical performance with business outcomes. Use TelemetryClient for custom events and metrics, ILogger with structured properties for contextual traces, and operation holders to group related telemetry items under a single operation ID for end-to-end correlation.

Tracking Custom Events, Metrics & Dependencies

public class OrderService(TelemetryClient telemetry, ILogger<OrderService> logger)
{
    public async Task<Order> CreateOrderAsync(CreateOrderCommand cmd, CancellationToken ct)
    {
        // ── Custom Event — business milestone ─────────────────────────
        telemetry.TrackEvent("OrderCreated", new Dictionary<string, string>
        {
            ["CustomerId"] = cmd.CustomerId.ToString(),
            ["Region"]     = cmd.Region,
            ["ItemCount"]  = cmd.LineItems.Count.ToString(),
        },
        new Dictionary<string, double>
        {
            ["OrderAmount"] = (double)cmd.TotalAmount,
        });

        // ── Custom Metric — numeric measurement ───────────────────────
        telemetry.TrackMetric("Order.TotalAmount", (double)cmd.TotalAmount,
            new Dictionary<string, string> { ["Region"] = cmd.Region });

        // ── Custom Dependency — any external call ─────────────────────
        var startTime = DateTimeOffset.UtcNow;
        var timer     = Stopwatch.StartNew();
        bool success  = false;
        try
        {
            var result = await _inventoryClient.ReserveItemsAsync(cmd.LineItems, ct);
            success = result.IsSuccess;
            return result.Value;
        }
        catch (Exception ex)
        {
            // ── Exception with extra context ──────────────────────────
            telemetry.TrackException(ex, new Dictionary<string, string>
            {
                ["OrderId"]    = cmd.OrderId.ToString(),
                ["CustomerId"] = cmd.CustomerId.ToString(),
                ["Operation"]  = "CreateOrder",
            });
            throw;
        }
        finally
        {
            timer.Stop();
            telemetry.TrackDependency(
                dependencyTypeName: "gRPC",
                target:   "inventory-api",
                dependencyName: "InventoryClient.ReserveItems",
                data:     $"CustomerId={cmd.CustomerId}",
                startTime: startTime,
                duration:  timer.Elapsed,
                resultCode: success ? "200" : "500",
                success:   success);
        }
    }
}

// ── Structured logging — ILogger feeds into App Insights traces ───────
logger.LogInformation(
    "Order {OrderId} created for customer {CustomerId} — amount {Amount:C}",
    order.Id, order.CustomerId, order.TotalAmount);

// ── Using scopes for correlated log groups ────────────────────────────
using (logger.BeginScope(new Dictionary<string, object>
{
    ["OrderId"]    = order.Id,
    ["CustomerId"] = order.CustomerId,
    ["RequestId"]  = Activity.Current?.TraceId.ToString() ?? ""
}))
{
    logger.LogInformation("Starting payment processing");
    await ProcessPaymentAsync(order, ct);
    logger.LogInformation("Payment processed successfully");
}

Operation Tracking — Grouping Related Telemetry

// Use IOperationHolder to group related telemetry items
// All items inside the using block share the same operation ID
public async Task ProcessMessageAsync(ServiceBusReceivedMessage message)
{
    using var operation = telemetry.StartOperation<RequestTelemetry>(
        "ServiceBus.ProcessOrder");

    operation.Telemetry.Properties["MessageId"]  = message.MessageId;
    operation.Telemetry.Properties["QueueName"]  = "orders";

    try
    {
        var order = JsonSerializer.Deserialize<OrderCreatedEvent>(message.Body);
        await ProcessOrderInternalAsync(order!);

        operation.Telemetry.Success    = true;
        operation.Telemetry.ResponseCode = "200";
    }
    catch (Exception ex)
    {
        operation.Telemetry.Success    = false;
        operation.Telemetry.ResponseCode = "500";
        telemetry.TrackException(ex);
        throw;
    }
}

02cSampling & Cost Control

Application Insights charges per GB ingested. For high-traffic services, sampling is essential — it reduces data volume while preserving statistical accuracy and keeping correlated telemetry together (all spans of one trace are either all sampled or all dropped).

Sampling Type	Where Configured	How It Works	Best For
Adaptive Sampling	SDK (default on)	Auto-adjusts rate to stay under target events/sec	Variable traffic — production default
Fixed-Rate Sampling	SDK config	Fixed % of operations sampled — predictable volume	Predictable billing, A/B comparison
Ingestion Sampling	App Insights portal	Drops data after arrival — no SDK change needed	Quick cost reduction without redeployment
OpenTelemetry Sampling	OTel SDK	Head-based or tail-based at trace level	Modern OTel pipelines

// Adaptive sampling — recommended for most services
services.AddApplicationInsightsTelemetry(options =>
{
    options.EnableAdaptiveSampling = true;
});

services.Configure<TelemetryConfiguration>(config =>
{
    var adaptiveSamplingProcessor = new AdaptiveSamplingTelemetryProcessor(
        new SamplingPercentageEstimatorSettings
        {
            MaxTelemetryItemsPerSecond = 5,     // Target: max 5 events/sec per instance
            MinSamplingPercentage      = 0.1,   // Never sample less than 0.1%
            MaxSamplingPercentage      = 100,   // Full sampling when low traffic
            EvaluationInterval         = TimeSpan.FromSeconds(15),
            SamplingPercentageDecreaseTimeout = TimeSpan.FromMinutes(2),
            SamplingPercentageIncreaseTimeout = TimeSpan.FromMinutes(15),
        },
        next: config.TelemetryProcessorChainBuilder.Build());

    config.TelemetryProcessorChainBuilder
        .Use(_ => adaptiveSamplingProcessor)
        .Build();
});

// ── Fixed-rate sampling — predictable volume ──────────────────────────
services.Configure<TelemetryConfiguration>(config =>
{
    config.TelemetryProcessorChainBuilder
        .UseSampling(samplingPercentage: 10) // Sample 10% of operations
        .Build();
});

// ── Exclude specific telemetry from sampling ──────────────────────────
public class ExcludeHealthCheckFilter : ITelemetryProcessor
{
    private readonly ITelemetryProcessor _next;

    public ExcludeHealthCheckFilter(ITelemetryProcessor next) => _next = next;

    public void Process(ITelemetry item)
    {
        // Don't send health check telemetry to App Insights at all
        if (item is RequestTelemetry req &&
            req.Url?.AbsolutePath.StartsWith("/health") == true)
            return;

        _next.Process(item);
    }
}

services.Configure<TelemetryConfiguration>(config =>
{
    config.TelemetryProcessorChainBuilder
        .Use(next => new ExcludeHealthCheckFilter(next))
        .UseSampling(10)
        .Build();
});

💰

Sampling preserves trace integrityWhen a request is sampled, App Insights samples all dependent telemetry with the same operation ID — the complete trace stays together or is dropped together. You never see a request without its dependencies, or a dependency without its parent request.

02dAvailability Tests

Availability tests run synthetic requests to your endpoints from Azure global PoP locations on a schedule — detecting outages from the user's perspective before your monitoring picks them up internally.

Test Type	Description	Best For
Standard (URL ping)	Simple HTTP GET/POST to one URL — checks status code and response time	API health endpoints, uptime SLA monitoring
Multi-step (TrackAvailability)	Custom code simulating a user journey — login, search, checkout	Critical user flows, end-to-end smoke tests
Custom TrackAvailability	Emit availability telemetry from your own infrastructure	Private endpoints not reachable from Azure PoPs

// Custom availability test via Azure Function (for private endpoints)
[FunctionName("AvailabilityTest")]
public async Task Run([TimerTrigger("0 */5 * * * *")] TimerInfo timer)
{
    var testName    = "Orders API — Create Order Flow";
    var runLocation = "East US";
    var startTime   = DateTimeOffset.UtcNow;
    var timer2      = Stopwatch.StartNew();
    bool success    = false;
    string message  = "";

    try
    {
        // Step 1: Authenticate
        var token = await GetTestTokenAsync();

        // Step 2: Create a test order
        _httpClient.DefaultRequestHeaders.Authorization =
            new AuthenticationHeaderValue("Bearer", token);

        var response = await _httpClient.PostAsJsonAsync(
            "/api/v1/orders",
            new { customerId = TestCustomerId, lineItems = TestLineItems });

        success = response.IsSuccessStatusCode;
        message = $"Status: {(int)response.StatusCode}";

        // Step 3: Verify the order exists
        if (success)
        {
            var order = await response.Content.ReadFromJsonAsync<OrderResponse>();
            var getResponse = await _httpClient.GetAsync($"/api/v1/orders/{order!.OrderId}");
            success = getResponse.IsSuccessStatusCode;
            message += $" | GET: {(int)getResponse.StatusCode}";
        }
    }
    catch (Exception ex)
    {
        success = false;
        message = ex.Message;
    }
    finally
    {
        timer2.Stop();
        _telemetry.TrackAvailability(
            name:      testName,
            timeStamp: startTime,
            duration:  timer2.Elapsed,
            runLocation: runLocation,
            success:   success,
            message:   message);
    }
}

02eApplication Map & Live Metrics

Application Map automatically renders your service topology — nodes for each component, edges for dependency calls, failure rates and latency on each edge. Live Metrics streams real-time telemetry with sub-second latency — essential during deployments and incident response.

// ── App Insights KQL — Failure rate by operation in last 1h ─────────
requests
| where timestamp > ago(1h)
| summarize
    Total    = count(),
    Failed   = countif(success == false),
    P95_ms   = percentile(duration, 95),
    P99_ms   = percentile(duration, 99)
  by name
| extend FailureRate = round(100.0 * Failed / Total, 2)
| where Total > 10
| order by FailureRate desc

// ── Slow dependency calls (SQL > 1s) ──────────────────────────────────
dependencies
| where timestamp > ago(1h)
| where type == "SQL"
| where duration > 1000
| project timestamp, name, data, duration, success, resultCode
| order by duration desc
| take 50

// ── Top exceptions in last 24h ────────────────────────────────────────
exceptions
| where timestamp > ago(24h)
| summarize Count = count() by type, outerMessage
| order by Count desc
| take 20

// ── User journey — trace one operation end-to-end ─────────────────────
let opId = "abc123def456";
union requests, dependencies, exceptions, traces
| where operation_Id == opId
| project timestamp, itemType, name, duration, success, message, type
| order by timestamp asc

03Log Analytics

Log Analytics Workspace is Azure's centralised log aggregation and query engine. Every Azure service can ship its diagnostic logs here. Your applications send structured logs via the App Insights SDK or the OTel OTLP exporter. Queries are written in KQL (Kusto Query Language) — a powerful, expressive SQL-like language purpose-built for log analytics.

Log Source	How to Connect	Key Tables
Azure App Service	Diagnostic Settings → Log Analytics	AppServiceHTTPLogs, AppServiceConsoleLogs
Azure Functions	Diagnostic Settings + AI SDK	FunctionAppLogs, requests, traces
Azure Kubernetes Service	Container Insights add-on	ContainerLog, KubePodInventory, KubeEvents
Azure API Management	Diagnostic Settings	ApiManagementGatewayLogs
Azure Service Bus	Diagnostic Settings	AzureDiagnostics (ResourceType=NAMESPACES)
Azure Key Vault	Diagnostic Settings	AzureDiagnostics (ResourceType=VAULTS)
Application Insights	Linked workspace (auto)	requests, dependencies, exceptions, traces
Custom App	OTel OTLP exporter / DCR API	Custom table or AppTraces
Azure Activity Log	Export to workspace	AzureActivity
VM / Arc	Azure Monitor Agent	Syslog, SecurityEvent, Event

03aWorkspace Design

How you structure your Log Analytics workspaces determines your query capabilities, access control boundaries, and cost efficiency. The recommended pattern for most organizations is a centralized workspace per environment with table-level RBAC — this enables cross-service correlation queries while maintaining security isolation. Always link Application Insights to a workspace so you can join APM data with platform logs in a single KQL query. Consider commitment tiers once you exceed 100 GB/day ingestion, as they offer up to 30% savings over pay-as-you-go pricing.

Decision	Recommendation	Reason
Workspaces per environment	One per environment (dev/staging/prod)	Isolation — no dev noise in prod queries
Workspaces per region	One per region if data residency required	Compliance — data stays in region
Centralised vs per-team	Central workspace + table-level RBAC	Cost efficiency + cross-service correlation
Retention period	30 days hot, up to 730 days archive	Balance cost vs forensics requirement
Commitment tier	Use commitment tier at >100 GB/day	Up to 30% discount over pay-as-you-go
Linked App Insights	Always link AI to a workspace	Unified queries across AI + platform logs

# Create Log Analytics Workspace
az monitor log-analytics workspace create \
  --resource-group myRG \
  --workspace-name myPlatformWorkspace \
  --location eastus \
  --retention-time 90 \
  --sku PerGB2018

# Get workspace ID (for App Insights linking)
WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group myRG \
  --workspace-name myPlatformWorkspace \
  --query id --output tsv)

# Create Application Insights linked to the workspace
az monitor app-insights component create \
  --app myOrdersApi \
  --resource-group myRG \
  --location eastus \
  --workspace "$WORKSPACE_ID" \
  --kind web

# Enable diagnostic settings for Service Bus → Workspace
az monitor diagnostic-settings create \
  --resource "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ServiceBus/namespaces/<ns>" \
  --name "sb-diagnostics" \
  --workspace "$WORKSPACE_ID" \
  --logs '[{"category":"OperationalLogs","enabled":true},{"category":"VNetAndIPFilteringLogs","enabled":true}]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

03bKQL — Core Queries

KQL (Kusto Query Language) is the query language for Log Analytics and Application Insights — it is purpose-built for exploring large volumes of telemetry data with sub-second response times. Unlike SQL, KQL uses a pipe-based syntax where each operator transforms the result set flowing through it, making complex queries readable and composable. Always start with a time filter to limit the data scanned, use has instead of contains for better performance on word-boundary matches, and leverage summarize with bin() for time-series aggregations that render beautifully as charts.

KQL Cheat Sheet — Essential Operators

// ── Filtering ─────────────────────────────────────────────────────────
TableName
| where TimeGenerated > ago(1h)                   // Time filter (always first!)
| where Level == "Error"                           // Exact match
| where Message contains "timeout"                // Case-insensitive substring
| where Message has "OrderId"                      // Faster than contains for words
| where StatusCode between (500 .. 599)            // Numeric range
| where isnotempty(CorrelationId)                  // Not null/empty
| where Properties.region in ("EU", "US")         // Value in list

// ── Projection ────────────────────────────────────────────────────────
| project TimeGenerated, Level, Message, Properties  // Select columns
| project-away TenantId, SubscriptionId              // Drop columns
| extend DurationSec = Duration / 1000.0             // Add computed column
| parse Message with "OrderId=" OrderId " amount=" Amount:double " region=" Region

// ── Aggregation ───────────────────────────────────────────────────────
| summarize
    Count       = count(),
    ErrorCount  = countif(Level == "Error"),
    AvgDuration = avg(Duration),
    P95Duration = percentile(Duration, 95),
    MaxDuration = max(Duration)
  by bin(TimeGenerated, 5m), ServiceName

// ── Sorting & Limiting ────────────────────────────────────────────────
| order by TimeGenerated desc
| top 100 by Duration desc        // Top N by a column

// ── Joining tables ────────────────────────────────────────────────────
requests
| join kind=leftouter (
    exceptions
    | project operation_Id, exceptionType = type, exceptionMsg = outerMessage
) on operation_Id

// ── String operations ─────────────────────────────────────────────────
| extend OrderId = extract("OrderId=([a-f0-9-]+)", 1, Message)
| extend Domain  = split(Email, "@")[1]
| where ServiceName startswith "order"
| where Url matches regex @"/api/v[0-9]+/orders/[a-f0-9-]+"

// ── Time operations ───────────────────────────────────────────────────
| extend HourOfDay  = hourofday(TimeGenerated)
| extend DayOfWeek  = dayofweek(TimeGenerated)
| where TimeGenerated between (datetime(2026-05-01) .. datetime(2026-05-08))

Essential Production Queries

// ── Error rate trend (5-min buckets) ─────────────────────────────────
requests
| where TimeGenerated > ago(3h)
| summarize
    Total  = count(),
    Failed = countif(success == false)
  by bin(TimeGenerated, 5m), cloud_RoleName
| extend ErrorRate = round(100.0 * Failed / Total, 2)
| render timechart

// ── P50/P95/P99 latency by endpoint ──────────────────────────────────
requests
| where TimeGenerated > ago(1h)
| summarize
    P50 = percentile(duration, 50),
    P95 = percentile(duration, 95),
    P99 = percentile(duration, 99),
    Count = count()
  by name
| where Count > 100
| order by P99 desc

// ── Dependency failures — which external calls are breaking? ──────────
dependencies
| where TimeGenerated > ago(1h)
| where success == false
| summarize
    FailureCount = count(),
    AvgDuration  = avg(duration),
    Targets      = make_set(target)
  by type, name
| order by FailureCount desc

// ── Exceptions — top 10 with sample messages ─────────────────────────
exceptions
| where TimeGenerated > ago(24h)
| summarize
    Count   = count(),
    Sample  = any(outerMessage),
    LastSeen = max(TimeGenerated)
  by type
| top 10 by Count desc

// ── Slow SQL queries ──────────────────────────────────────────────────
dependencies
| where TimeGenerated > ago(1h)
| where type == "SQL"
| where duration > 500
| project TimeGenerated, name, data, duration, resultCode
| order by duration desc
| take 20

// ── Container restart loop detection ─────────────────────────────────
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "OOMKilled" or LogEntry contains "CrashLoopBackOff"
| summarize Restarts = count() by ContainerID, _ResourceId
| where Restarts > 3
| order by Restarts desc

03cKQL — Advanced Patterns

Advanced KQL patterns unlock powerful capabilities like anomaly detection, cross-workspace correlation, funnel analysis, and dynamic JSON parsing — these are the queries that power production alert rules and executive dashboards. Error spike detection compares current error counts against a rolling baseline to catch regressions without hardcoded thresholds. Cross-workspace queries let you correlate security events with application failures, while funnel analysis tracks user drop-off through multi-step business flows. Master these patterns and you can answer almost any operational or business question from your telemetry data alone.

// ── Alerting query: error spike detection (> 2x baseline) ───────────
let baseline = requests
    | where TimeGenerated between (ago(2h) .. ago(1h))
    | summarize BaselineErrors = countif(success == false);
let current = requests
    | where TimeGenerated > ago(1h)
    | summarize CurrentErrors = countif(success == false);
current
| cross join baseline
| where CurrentErrors > BaselineErrors * 2 and CurrentErrors > 10
| project CurrentErrors, BaselineErrors,
          Increase = round(100.0 * (CurrentErrors - BaselineErrors) / BaselineErrors, 1)

// ── User impact analysis — how many users hit errors? ─────────────────
requests
| where TimeGenerated > ago(1h)
| where success == false
| summarize
    FailedRequests = count(),
    AffectedUsers  = dcount(user_Id),
    AffectedOps    = dcount(operation_Id)
  by name, resultCode
| order by AffectedUsers desc

// ── Funnel analysis — order creation drop-off ─────────────────────────
let step1 = customEvents | where name == "CartViewed"     | summarize s1 = dcount(user_Id);
let step2 = customEvents | where name == "CheckoutStarted" | summarize s2 = dcount(user_Id);
let step3 = customEvents | where name == "OrderSubmitted"  | summarize s3 = dcount(user_Id);
let step4 = customEvents | where name == "OrderConfirmed"  | summarize s4 = dcount(user_Id);
step1 | cross join step2 | cross join step3 | cross join step4
| project
    CartViewed        = s1,
    CheckoutStarted   = s2,
    OrderSubmitted    = s3,
    OrderConfirmed    = s4,
    CheckoutConversion = round(100.0 * s2 / s1, 1),
    SubmitConversion   = round(100.0 * s3 / s2, 1),
    FinalConversion    = round(100.0 * s4 / s1, 1)

// ── Cross-workspace query — correlate App Insights + Security logs ────
workspace("SecurityWorkspace").SecurityEvent
| where TimeGenerated > ago(1h)
| where EventID == 4625 // Failed login
| join kind=inner (
    app("OrdersAppInsights").requests
    | where success == false
    | project operation_Id, clientIP = client_IP
) on $left.IpAddress == $right.clientIP
| project TimeGenerated, IpAddress, Account, operation_Id

// ── Dynamic columns from JSON properties ──────────────────────────────
customEvents
| where name == "OrderCreated"
| extend
    OrderId  = tostring(customDimensions["OrderId"]),
    Amount   = todouble(customDimensions["OrderAmount"]),
    Region   = tostring(customDimensions["Region"])
| summarize Revenue = sum(Amount), Orders = count() by Region, bin(TimeGenerated, 1h)
| render timechart

03dKey Log Tables Reference

Every Azure service and Application Insights telemetry type maps to a specific table in Log Analytics — knowing which table to query is half the battle during incident response. The requests and dependencies tables are your go-to for API performance, while AzureDiagnostics is the catch-all for platform resource logs filtered by ResourceType. Bookmark this reference so you can jump straight to the right table when debugging production issues instead of guessing table names under pressure.

Table	Source	Key Columns
requests	App Insights	name, duration, success, resultCode, url, operation_Id, cloud_RoleName
dependencies	App Insights	name, type, target, data, duration, success, resultCode, operation_Id
exceptions	App Insights	type, outerMessage, innermostMessage, stack, operation_Id, cloud_RoleName
traces	App Insights	message, severityLevel, customDimensions, operation_Id, cloud_RoleName
customEvents	App Insights	name, customDimensions, customMeasurements, operation_Id, user_Id
customMetrics	App Insights	name, value, valueCount, valueSum, valueMin, valueMax
availabilityResults	App Insights	name, success, duration, location, message, runLocation
AzureDiagnostics	Azure Resources	ResourceType, OperationName, ResultType, Level, CallerIpAddress
AzureActivity	Azure RBAC / Control	OperationName, Caller, ResourceGroup, ActivityStatus, Level
ContainerLog	AKS / Container Insights	ContainerID, LogEntry, LogEntrySource, _ResourceId
KubePodInventory	AKS	PodName, Namespace, PodStatus, ContainerStatusReason, Node
KubeEvents	AKS	Name, Namespace, Reason, Message, KubeEventType
AppServiceHTTPLogs	App Service	CsMethod, CsUriStem, ScStatus, TimeTaken, CIp
FunctionAppLogs	Azure Functions	HostInstanceId, Message, ExceptionMessage, FunctionName, Level
ApiManagementGatewayLogs	APIM	ApiId, OperationId, ResponseCode, TotalTime, ClientIp
SigninLogs	Entra ID	UserPrincipalName, IPAddress, ResultType, ConditionalAccessStatus

04Azure Monitor

Azure Monitor is the platform-level observability service — it automatically collects metrics from every Azure resource (no agent needed), evaluates alert rules, triggers action groups, and aggregates data into workbooks and dashboards. It is the umbrella that App Insights and Log Analytics sit within.

Azure Monitor Feature	Description
Platform Metrics	Auto-collected numeric metrics from every Azure resource — CPU, requests, queue depth, etc.
Custom Metrics	Emit your own time-series from apps via OTel or App Insights SDK
Alert Rules	Evaluate metric/log conditions and fire on threshold breach
Action Groups	Notifications and automation triggered by alerts — email, SMS, webhook, PagerDuty, Logic App
Workbooks	Interactive parameterised reports mixing KQL, metrics, and markdown
Dashboards	Pinnable metric charts and tiles for NOC-style displays
Autoscale	Scale Azure resources (App Service, VMSS) based on metric thresholds
Change Analysis	Track infrastructure changes correlated with incidents
Service Health	Azure platform incidents affecting your specific resources

04aMetrics & Dimensions

Azure Monitor automatically collects platform metrics from every resource — CPU, memory, request counts, queue depths — with no agent or SDK required. These numeric time-series are stored for 93 days and can be queried with sub-minute granularity, making them ideal for real-time dashboards and alert rules. For application-specific measurements, you can emit custom metrics via the OpenTelemetry Meter API using counters, histograms, and gauges with dimensional tags that enable powerful filtering and grouping in metric explorer.

Key Metrics Per Service

Service	Critical Metrics to Monitor
App Service / Functions	CpuPercentage, MemoryWorkingSet, HttpResponseTime, HttpServerErrors, Requests
Azure SQL / SQL MI	cpu_percent, dtu_consumption_percent, deadlock, connection_failed, storage_percent
Azure Service Bus	ActiveMessages, DeadletteredMessages, IncomingMessages, ThrottledRequests, ServerErrors
Azure Event Hubs	IncomingMessages, OutgoingMessages, ThrottledRequests, IncomingBytes, CapturedMessages
Azure Event Grid	PublishSuccessCount, DeliverySuccessCount, DeadLetteredCount, DroppedEventCount
Azure Key Vault	ServiceApiHit, ServiceApiLatency, ServiceApiResult (failures)
Azure API Management	TotalRequests, SuccessfulRequests, FailedRequests, Duration, Capacity
Azure Container Apps	Replicas, CpuUsage, MemoryUsage, RequestCount, ResponseTime
AKS	node_cpu_usage_percentage, node_memory_working_set_percentage, kube_pod_status_phase
Azure Cosmos DB	TotalRequests, TotalRequestUnits, ServerSideLatency, NormalizedRUConsumption

Custom Metrics via OpenTelemetry (.NET 8)

using System.Diagnostics.Metrics;

// Define meters and instruments (static — created once)
public static class OrdersMetrics
{
    private static readonly Meter Meter = new("Orders.Api", "1.0.0");

    // Counter — monotonically increasing
    public static readonly Counter<long> OrdersCreated =
        Meter.CreateCounter<long>(
            name: "orders.created.total",
            unit: "orders",
            description: "Total number of orders created");

    // Histogram — distribution of values (latency, sizes)
    public static readonly Histogram<double> OrderProcessingDuration =
        Meter.CreateHistogram<double>(
            name: "orders.processing.duration",
            unit: "ms",
            description: "Time to process an order end-to-end");

    // ObservableGauge — current value from a callback
    public static readonly ObservableGauge<int> PendingOrders =
        Meter.CreateObservableGauge<int>(
            name: "orders.pending.count",
            observeValue: () => OrderRepository.GetPendingCount(),
            unit: "orders",
            description: "Current number of pending orders");

    // UpDownCounter — can increase and decrease
    public static readonly UpDownCounter<int> ActiveConnections =
        Meter.CreateUpDownCounter<int>(
            name: "orders.active.connections",
            unit: "connections");
}

// Usage in command handler
public async Task<Result<OrderResponse>> Handle(CreateOrderCommand cmd, CancellationToken ct)
{
    var sw = Stopwatch.StartNew();
    try
    {
        var order = Order.Create(cmd.CustomerId, cmd.LineItems);
        await _repository.SaveAsync(order, ct);

        // Record with dimensions (tags)
        OrdersMetrics.OrdersCreated.Add(1,
            new KeyValuePair<string, object?>("region",   cmd.Region),
            new KeyValuePair<string, object?>("channel",  cmd.Channel),
            new KeyValuePair<string, object?>("priority", cmd.IsPriority ? "high" : "normal"));

        return Result.Success(OrderResponse.From(order));
    }
    finally
    {
        sw.Stop();
        OrdersMetrics.OrderProcessingDuration.Record(sw.Elapsed.TotalMilliseconds,
            new KeyValuePair<string, object?>("region", cmd.Region));
    }
}

// Register meters with OTel
builder.Services.AddOpenTelemetry()
    .WithMetrics(m => m
        .AddMeter("Orders.Api")
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()           // GC, thread pool, heap
        .AddProcessInstrumentation()           // CPU, memory
        .AddOtlpExporter()                     // → Azure Monitor / Prometheus
        .AddPrometheusExporter());             // /metrics endpoint for Prometheus scraping

04bAlert Rules & Action Groups

Alert rules are the automated sentinels of your production environment — they continuously evaluate conditions against your metrics and logs, firing notifications when thresholds are breached. Action groups define who gets notified and how: email, SMS, webhook to PagerDuty, or even triggering a Logic App for automated remediation. The key to effective alerting is combining metric alerts for real-time threshold detection with log search alerts for complex pattern matching, while keeping severity levels aligned with actual user impact to prevent alert fatigue.

Alert Type	Condition Evaluated	Best For
Metric Alert	Numeric metric crosses threshold	CPU > 80%, error rate > 5%, queue depth > 1000
Log Search Alert	KQL query returns rows	DLQ messages, specific error patterns, security events
Activity Log Alert	Azure control-plane event occurs	Resource deleted, role assigned, policy violated
Smart Detection	AI-detected anomalies in App Insights	Sudden increase in exceptions, degraded response time
Resource Health Alert	Azure resource enters unhealthy state	VM unavailable, SQL inaccessible, App Service down

Create Alert Rule via CLI

# ── Step 1: Create Action Group (who gets notified) ─────────────────
az monitor action-group create \
  --resource-group myRG \
  --name platform-oncall \
  --short-name oncall \
  --action email oncall-email ops-team@contoso.com \
  --action webhook pagerduty https://events.pagerduty.com/integration/<key>/enqueue

# ── Step 2: Metric Alert — Service Bus DLQ > 0 ──────────────────────
az monitor metrics alert create \
  --name "ServiceBus-DLQ-Alert" \
  --resource-group myRG \
  --scopes "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ServiceBus/namespaces/<ns>" \
  --condition "avg DeadletteredMessages > 0" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action "/subscriptions/<sub>/resourceGroups/myRG/providers/microsoft.insights/actionGroups/platform-oncall" \
  --description "Messages in Service Bus dead-letter queue"

# ── Step 3: Log Alert — 5xx errors spike ─────────────────────────────
az monitor scheduled-query create \
  --name "High-Error-Rate-Alert" \
  --resource-group myRG \
  --scopes "<app-insights-resource-id>" \
  --condition-query "requests | where timestamp > ago(5m) | summarize FailRate = countif(success==false)*100.0/count() | where FailRate > 5" \
  --condition-time-aggregation Count \
  --condition-operator GreaterThan \
  --condition-threshold 0 \
  --evaluation-period 5 \
  --evaluation-frequency 1 \
  --severity 1 \
  --action-groups "/subscriptions/<sub>/resourceGroups/myRG/providers/microsoft.insights/actionGroups/platform-oncall"

Production Alert Rules Starter Pack

Alert Name	Metric / Query	Threshold	Severity
High Error Rate	requests — failure rate	> 5% over 5 min	Sev 1
P99 Latency Spike	requests — P99 duration	> 3000 ms	Sev 2
DLQ Messages	DeadletteredMessages	> 0	Sev 2
CPU Sustained High	CpuPercentage	> 85% for 15 min	Sev 2
Memory Near Limit	MemoryWorkingSet	> 90% of limit	Sev 2
Availability Failed	availabilityResults — success==false	Any failure	Sev 1
Key Vault Unauthorized	AzureDiagnostics — ResultType!=Success	Any	Sev 1
Pod CrashLoopBackOff	KubeEvents — Reason==BackOff	Any	Sev 2
Storage Near Capacity	UsedCapacity	> 80% of limit	Sev 3
Exception Spike	exceptions — count vs 1h baseline	> 2x baseline	Sev 2

04cWorkbooks & Dashboards

Workbooks are interactive, parameterised reports that combine KQL queries, Azure Metrics, markdown text, and ARM data into a single document. They are ideal for SLO reports, incident postmortems, capacity planning, and team-facing health dashboards.

// Workbook ARM template structure (simplified)
{
  "type": "microsoft.insights/workbooks",
  "properties": {
    "displayName": "Orders API — SLO Dashboard",
    "serializedData": {
      "version": "Notebook/1.0",
      "items": [
        {
          "type": 9,          // Parameters
          "content": {
            "parameters": [
              {
                "id": "timeRange",
                "version": "KqlParameterItem/1.0",
                "name": "TimeRange",
                "type": 4,
                "value": { "durationMs": 3600000 }
              },
              {
                "id": "service",
                "name": "ServiceName",
                "type": 2,             // Drop-down
                "query": "requests | summarize by cloud_RoleName"
              }
            ]
          }
        },
        {
          "type": 3,          // Query item (chart)
          "content": {
            "query": "requests | where cloud_RoleName == '{ServiceName}' | summarize ErrorRate = countif(success==false)*100.0/count() by bin(TimeGenerated, 5m) | render timechart",
            "visualization": "timechart",
            "title": "Error Rate % — {ServiceName}"
          }
        }
      ]
    }
  }
}

04dDiagnostic Settings

Diagnostic settings control where an Azure resource sends its platform logs and metrics. Every production resource should have diagnostic settings routing to a Log Analytics workspace.

// Bicep — apply diagnostic settings to every resource via module
module diagnostics 'modules/diagnostics.bicep' = {
  name: 'orders-api-diagnostics'
  params: {
    resourceId: ordersAppService.id
    workspaceId: logAnalyticsWorkspace.id
    retentionDays: 90
  }
}

// modules/diagnostics.bicep
resource diagnosticSetting 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'send-to-workspace'
  scope: resourceId
  properties: {
    workspaceId: workspaceId
    logs: [
      { categoryGroup: 'allLogs',   enabled: true,  retentionPolicy: { days: retentionDays, enabled: true } }
    ]
    metrics: [
      { category: 'AllMetrics', enabled: true, retentionPolicy: { days: retentionDays, enabled: true } }
    ]
  }
}

⚠️

Enable Diagnostic Settings at Deployment TimeDiagnostic settings are not enabled by default. Missing settings means missing logs during an incident. Enforce them with Azure Policy (Deploy-Diagnostics initiative) so every new resource automatically ships logs to your workspace.

05Distributed Tracing

Distributed tracing answers the question: "What happened to this request as it moved across my 12 microservices?" Each service creates a span (a named, timed operation). Spans are linked by a sharedtrace-id propagated in HTTP headers and message properties. The resulting tree of spans is a trace — showing the full causal chain, timing, and errors.

Concept	Definition
Trace	The entire journey of one request across all services — a tree of spans
Span	A single named, timed operation within one service (HTTP handler, DB query, cache lookup)
TraceId	128-bit ID unique to one trace — shared by all spans in that trace
SpanId	64-bit ID unique to one span — used as parent reference by child spans
ParentSpanId	The SpanId of the span that created this one — builds the tree structure
Baggage	Key/value pairs propagated through the entire trace — for cross-cutting context
W3C TraceContext	HTTP header standard: traceparent + tracestate — the propagation format
OTLP	OpenTelemetry Protocol — standard wire format for exporting traces, metrics, logs
Sampling	Decision to record or discard a trace — head-based (at trace start) or tail-based

05aOpenTelemetry in .NET 8

OpenTelemetry (OTel) is the vendor-neutral, CNCF-backed standard for instrumentation — it lets you collect traces, metrics, and logs with a single SDK and export to any backend (Azure Monitor, Jaeger, Datadog, Grafana). In .NET 8, OTel is a first-class citizen with native support via System.Diagnostics.Activity for tracing and System.Diagnostics.Metrics for metrics. The setup configures resource metadata (service name, version, environment), adds auto-instrumentation for ASP.NET Core, HttpClient, and EF Core, then exports via OTLP — giving you full observability with minimal code changes.

Full OTel Setup — Tracing + Metrics + Logs

// NuGet packages needed:
// OpenTelemetry.Extensions.Hosting
// OpenTelemetry.Instrumentation.AspNetCore
// OpenTelemetry.Instrumentation.Http
// OpenTelemetry.Instrumentation.EntityFrameworkCore
// OpenTelemetry.Instrumentation.Runtime
// OpenTelemetry.Exporter.OpenTelemetryProtocol
// Azure.Monitor.OpenTelemetry.AspNetCore

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    // ── Resource metadata (common tags on all signals) ─────────────
    .ConfigureResource(resource => resource
        .AddService(
            serviceName:    "orders-api",
            serviceVersion: "2.1.0",
            serviceInstanceId: Environment.MachineName)
        .AddAttributes(new Dictionary<string, object>
        {
            ["deployment.environment"] = builder.Environment.EnvironmentName,
            ["service.namespace"]      = "com.myplatform",
            ["k8s.pod.name"]           = Environment.GetEnvironmentVariable("HOSTNAME") ?? "",
        }))

    // ── Tracing ────────────────────────────────────────────────────
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException        = true;
            options.EnrichWithHttpRequest  = (activity, request) =>
            {
                activity.SetTag("http.request.body_size",
                    request.ContentLength ?? 0);
            };
            options.EnrichWithHttpResponse = (activity, response) =>
            {
                activity.SetTag("http.response.body_size",
                    response.ContentLength ?? 0);
            };
            // Don't trace health checks — too noisy
            options.Filter = ctx =>
                !ctx.Request.Path.StartsWithSegments("/health");
        })
        .AddHttpClientInstrumentation(options =>
        {
            options.RecordException = true;
            // Redact auth headers from traces
            options.EnrichWithHttpRequestMessage = (activity, request) =>
            {
                request.Headers.Remove("Authorization");
            };
        })
        .AddEntityFrameworkCoreInstrumentation(options =>
        {
            options.SetDbStatementForText        = true;  // Capture SQL
            options.SetDbStatementForStoredProcedure = true;
        })
        .AddSource("Orders.Api")           // Custom ActivitySource
        .AddSource("MassTransit")          // MassTransit instrumentation
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(
                builder.Configuration["Otel:Endpoint"] ?? "http://localhost:4317");
            options.Protocol = OtlpExportProtocol.Grpc;
        }))

    // ── Metrics ────────────────────────────────────────────────────
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddProcessInstrumentation()
        .AddMeter("Orders.Api")
        .AddOtlpExporter())

    // ── Logging ────────────────────────────────────────────────────
    .WithLogging(logging => logging
        .AddOtlpExporter());

// Also export to Azure Monitor (Application Insights) simultaneously
builder.Services.AddOpenTelemetry().UseAzureMonitor();

Custom ActivitySource — Manual Spans

// Define ActivitySource (one per service or logical domain)
public static class OrdersTracing
{
    public static readonly ActivitySource Source =
        new("Orders.Api", "2.1.0");
}

// Create manual spans for business operations
public async Task<Result<OrderResponse>> Handle(
    CreateOrderCommand cmd, CancellationToken ct)
{
    // Start a new span — child of the current ambient activity
    using var activity = OrdersTracing.Source.StartActivity(
        "CreateOrder",
        ActivityKind.Internal);

    // Tag the span with business context
    activity?.SetTag("order.customer_id", cmd.CustomerId.ToString());
    activity?.SetTag("order.region",      cmd.Region);
    activity?.SetTag("order.item_count",  cmd.LineItems.Count);
    activity?.SetTag("order.channel",     cmd.Channel);

    try
    {
        // Nested span for inventory validation
        using var inventorySpan = OrdersTracing.Source.StartActivity(
            "ValidateInventory",
            ActivityKind.Client);
        inventorySpan?.SetTag("inventory.product_count", cmd.LineItems.Count);

        var inventory = await _inventoryClient.ValidateAsync(cmd.LineItems, ct);
        inventorySpan?.SetTag("inventory.all_available", inventory.AllAvailable);

        if (!inventory.AllAvailable)
        {
            activity?.SetStatus(ActivityStatusCode.Error, "Insufficient inventory");
            return Result.Failure<OrderResponse>("Some items are out of stock");
        }

        var order = Order.Create(cmd.CustomerId, cmd.LineItems);
        activity?.SetTag("order.id", order.Id.ToString());

        // Add an event (point-in-time annotation on the span)
        activity?.AddEvent(new ActivityEvent("OrderPersisted", tags:
            new ActivityTagsCollection { ["order.id"] = order.Id.ToString() }));

        activity?.SetStatus(ActivityStatusCode.Ok);
        return Result.Success(OrderResponse.From(order));
    }
    catch (Exception ex)
    {
        // Record exception on span
        activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
        activity?.RecordException(ex);
        throw;
    }
}

05bContext Propagation

For a distributed trace to work, the trace context must travel with the request across every hop — HTTP calls, message queue messages, gRPC calls. The W3C TraceContext standard defines how this works via HTTP headers. OTel handles this automatically for instrumented libraries.

// W3C traceparent header format
traceparent: 00-<trace-id>-<parent-span-id>-<flags>

// Example:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
//            ^  ^                                ^                ^
//            version  128-bit trace ID           64-bit span ID   flags (01=sampled)

// W3C tracestate — vendor-specific data
tracestate: az=<app-insights-specific-data>

// ── HttpClient propagates context automatically with OTel ───────────
// Nothing to do — OTel HttpClient instrumentation injects traceparent

// ── Manual propagation for Service Bus messages ───────────────────────
// Sender — inject current trace context into message properties
public async Task SendOrderEventAsync(OrderCreatedEvent evt)
{
    var message = new ServiceBusMessage(JsonSerializer.SerializeToUtf8Bytes(evt))
    {
        MessageId   = evt.OrderId.ToString(),
        ContentType = "application/json",
    };

    // Inject W3C trace context into message ApplicationProperties
    var propagator = Propagators.DefaultTextMapPropagator;
    propagator.Inject(
        new PropagationContext(Activity.Current?.Context ?? default, Baggage.Current),
        message.ApplicationProperties,
        (props, key, value) => props[key] = value);

    await _sender.SendMessageAsync(message);
}

// Receiver — extract trace context from message and set as parent
public async Task ProcessMessageAsync(ServiceBusReceivedMessage message)
{
    // Extract context from message properties
    var propagator = Propagators.DefaultTextMapPropagator;
    var parentContext = propagator.Extract(
        default,
        message.ApplicationProperties,
        (props, key) =>
        {
            if (props.TryGetValue(key, out var value))
                return [value?.ToString() ?? ""];
            return [];
        });

    // Start child span linked to the sender's trace
    using var activity = OrdersTracing.Source.StartActivity(
        "ServiceBus.ProcessOrder",
        ActivityKind.Consumer,
        parentContext.ActivityContext);

    activity?.SetTag("messaging.system",      "servicebus");
    activity?.SetTag("messaging.destination", "orders");
    activity?.SetTag("messaging.message_id",  message.MessageId);

    await ProcessOrderAsync(message.Body.ToString());
}

Baggage — Cross-Service Context

// Set baggage at entry point (API Gateway / BFF)
Baggage.Current = Baggage.SetBaggage("tenant.id",    request.TenantId);
Baggage.Current = Baggage.SetBaggage("user.id",      user.GetUserId());
Baggage.Current = Baggage.SetBaggage("feature.flags","beta-checkout=true");

// Read baggage anywhere in the call chain (propagated automatically)
var tenantId   = Baggage.GetBaggage("tenant.id");
var featureFlags = Baggage.GetBaggage("feature.flags");

// All spans in this trace automatically inherit baggage values
// Useful for: multi-tenant filtering, feature flag tracking,
//             A/B test attribution, compliance tagging

05cTrace Sampling Strategies

At scale, recording every single trace is prohibitively expensive — a service handling 10,000 requests per second generates terabytes of trace data daily. Sampling strategies let you keep costs manageable while preserving visibility into errors and slow requests. Head-based sampling decides at trace start whether to record (simple but may miss rare errors), while tail-based sampling buffers complete traces and keeps only interesting ones (errors, slow requests) — more powerful but requires additional infrastructure. The best production strategy combines ratio-based sampling for normal traffic with always-on sampling for critical paths like payments and checkout.

Strategy	Decision Point	Pros	Cons
AlwaysOn	Head (trace start)	See everything	Huge data volume and cost at scale
AlwaysOff	Head	Zero cost	No visibility at all — dev only
TraceIdRatio (N%)	Head	Predictable volume, simple config	May miss infrequent errors
ParentBased	Head	Respects upstream sampling decision	Can't override a downstream service
Tail-Based (Jaeger)	Tail (after trace)	Always keeps error traces	Requires full trace buffering, more infra
Adaptive (App Insights)	Head (dynamic)	Auto-adjusts to traffic volume	Proprietary — App Insights only

// ── Head-based sampling via OTel SDK ─────────────────────────────────

// 1. Always sample errors, sample 10% of successful traces
builder.Services.AddOpenTelemetry()
    .WithTracing(t => t
        .SetSampler(new CompositeApplicationSampler(
            // Always sample if there's an error on the span
            errorSampler: new AlwaysOnSampler(),
            // 10% of everything else
            defaultSampler: new TraceIdRatioBasedSampler(0.10)
        ))
    );

// Custom sampler — always sample slow requests
public class SlowRequestSampler : Sampler
{
    private readonly double _threshold;

    public SlowRequestSampler(double threshold) => _threshold = threshold;

    public override SamplingResult ShouldSample(in SamplingParameters parameters)
    {
        // Always sample if parent says so (preserve parent decision)
        if (parameters.ParentContext.TraceFlags.HasFlag(ActivityTraceFlags.Recorded))
            return new SamplingResult(SamplingDecision.RecordAndSample);

        // Sample based on URL priority
        var httpPath = parameters.Tags?
            .FirstOrDefault(t => t.Key == "http.target").Value?.ToString();

        // Always trace payment and checkout flows
        if (httpPath?.Contains("/checkout") == true ||
            httpPath?.Contains("/payments") == true)
            return new SamplingResult(SamplingDecision.RecordAndSample);

        // 5% sample for everything else
        return new SamplingResult(
            Random.Shared.NextDouble() < 0.05
                ? SamplingDecision.RecordAndSample
                : SamplingDecision.Drop);
    }
}

05dJaeger & Trace Backends

Your trace data needs a backend to store, index, and visualize it — and you have several options depending on your environment and budget. Azure Monitor (Application Insights) is the native choice for Azure workloads with zero infrastructure to manage, while Jaeger and Grafana Tempo are excellent open-source alternatives for local development, staging environments, or multi-cloud setups. The beauty of OpenTelemetry is that you can switch backends by changing exporter configuration alone — your instrumentation code stays the same. Use Jaeger locally via Docker for instant trace visualization during development, then export to Azure Monitor in production.

Backend	Type	Best For	Azure Integration
Azure Monitor (App Insights)	Managed SaaS	Full Azure-native stack, E2E view	Native — UseAzureMonitor()
Jaeger	Open source	Dev/staging, self-hosted tracing	Deploy on AKS, OTLP ingest
Grafana Tempo	Open source	Cost-effective, Grafana-native	Deploy on AKS, query in Grafana
Zipkin	Open source	Legacy, simple setup	OTLP export via OTel collector
Datadog APM	Commercial SaaS	Cross-cloud, ML anomaly detection	OTLP export to Datadog agent
Honeycomb	Commercial SaaS	High-cardinality event analytics	OTLP export

Local Dev with Jaeger via Docker

# docker-compose.override.yml — dev environment
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC receiver
      - "4318:4318"     # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks: [platform]

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4319:4317"     # Receive from apps
    networks: [platform]

# otel-collector-config.yaml
# receivers:
#   otlp:
#     protocols: { grpc: { endpoint: 0.0.0.0:4317 } }
# exporters:
#   jaeger:
#     endpoint: jaeger:14250
#   azuremonitor:
#     connection_string: "${AI_CONNECTION_STRING}"
# service:
#   pipelines:
#     traces:
#       receivers: [otlp]
#       exporters: [jaeger, azuremonitor]

// appsettings.Development.json — point to local Jaeger
{
  "Otel": {
    "Endpoint": "http://localhost:4317"
  },
  "ApplicationInsights": {
    "ConnectionString": ""
  }
}

// appsettings.Production.json — Azure Monitor
{
  "Otel": {
    "Endpoint": "https://eastus-8.in.applicationinsights.azure.com/"
  },
  "ApplicationInsights": {
    "ConnectionString": "@Microsoft.KeyVault(VaultName=myVault;SecretName=ai-conn-string)"
  }
}

06SLIs, SLOs & Error Budgets

SLOs (Service Level Objectives) are the bridge between engineering and business. They define what "good enough" looks like for your service in terms of measurable user experience.

Term	Definition	Example
SLI (Indicator)	The metric you measure	P99 latency of /api/orders POST
SLO (Objective)	The target level for the SLI	P99 latency < 500ms for 99.5% of requests over 30 days
SLA (Agreement)	Contract with users/customers — legal commitment	99.9% uptime per month
Error Budget	SLO budget you can spend on incidents and deployments	0.5% of requests allowed to fail in 30 days
Burn Rate	How fast you're consuming error budget	2x burn rate = budget gone in 15 days not 30

SLO Queries in KQL

// ── Availability SLO — % of successful requests ──────────────────────
let sloWindow = 30d;
let sloTarget = 99.5; // 99.5% availability
requests
| where TimeGenerated > ago(sloWindow)
| where name !startswith "GET /health"   // Exclude health checks from SLO
| summarize
    Total   = count(),
    Success = countif(success == true)
| extend
    AvailabilitySLI = round(100.0 * Success / Total, 3),
    ErrorBudget_pct = 100.0 - sloTarget,
    ActualErrors    = Total - Success,
    BudgetedErrors  = round(Total * (1 - sloTarget / 100.0), 0),
    BudgetRemaining = round(Total * (1 - sloTarget / 100.0), 0) - (Total - Success)
| extend SLO_Met = AvailabilitySLI >= sloTarget

// ── Latency SLO — P99 must be under 500ms ─────────────────────────────
requests
| where TimeGenerated > ago(30d)
| where name == "POST /api/v1/orders"
| summarize
    TotalRequests     = count(),
    WithinTarget      = countif(duration < 500),
    P50_ms = percentile(duration, 50),
    P95_ms = percentile(duration, 95),
    P99_ms = percentile(duration, 99)
| extend
    SLI_latency = round(100.0 * WithinTarget / TotalRequests, 3),
    SLO_target  = 99.0,
    SLO_Met     = (100.0 * WithinTarget / TotalRequests) >= 99.0

// ── Error budget burn rate alert ──────────────────────────────────────
// Alert when burning through budget more than 2x faster than sustainable
let budget_pct   = 0.5;   // 0.5% of requests can fail
let burn_window  = 1h;
let total_window = 30d;
let totalRequests = toscalar(requests | where TimeGenerated > ago(total_window) | count);
let budgetTotal   = totalRequests * (budget_pct / 100.0);
let recentErrors  = toscalar(
    requests | where TimeGenerated > ago(burn_window) | where success == false | count);
let sustainableBurnPerHour = budgetTotal / (30.0 * 24.0);
recentErrors > sustainableBurnPerHour * 2  // Alert: burning 2x too fast

07Alerting Patterns & Runbooks

Good alerting is the difference between catching incidents in minutes versus hours — but bad alerting creates noise that trains your team to ignore pages. The golden rule is to alert on user-visible symptoms (error rate, latency, availability) rather than internal causes (CPU, memory), and every alert must have a clear runbook explaining what to do when it fires. Combine multi-window burn rate alerts for SLO-based detection with deadman's switches for silence detection, and tier severity levels so Sev1 pages on-call immediately while Sev3 creates a ticket for next business day.

🚨

Symptom-Based Alerting

Alert on user-visible symptoms (error rate, latency, availability) not internal causes (CPU, memory). Users don't care about CPU — they care about failed requests.

📊

Multi-Window Burn Rate

Alert on error budget burn rate at two windows (1h fast-burn AND 6h slow-burn). Catches both sudden spikes and gradual degradation before the budget is exhausted.

🔇

Alert Fatigue Prevention

Every alert must be actionable with a clear runbook. Silence noisy alerts or increase thresholds. An alert that fires every day and nobody acts on is worse than no alert.

📝

Runbook Links

Every alert includes a link to its runbook in the description — what the alert means, initial diagnosis steps, escalation path, and mitigation actions.

🌡️

Deadman's Switch

Alert when a metric stops being reported — e.g. if no requests arrive in 5 min, something is broken. Silence is often worse than an error.

🎯

Severity Tiering

Sev1: page on-call immediately (user impact now). Sev2: notify team (degradation). Sev3: ticket (investigate next business day). Match severity to actual impact.

Runbook Template

## Alert: High Error Rate — Orders API

**Severity:** Sev 1
**SLO Impact:** Yes — consuming error budget at >2x sustainable rate

### What this means
More than 5% of POST /api/v1/orders requests are failing over the last 5 minutes.
Users cannot place orders.

### Immediate Diagnosis (< 5 minutes)
1. Check Application Map in App Insights — which dependency is failing?
   Link: https://portal.azure.com/#resource/<ai-resource-id>/applicationMap

2. Run this query in Log Analytics:
   ```kql
   exceptions
   | where timestamp > ago(30m)
   | where cloud_RoleName == "orders-api"
   | summarize count() by outerMessage
   | order by count_ desc
   ```

3. Check recent deployments:
   Link: https://dev.azure.com/<org>/<project>/_release

### Common Causes & Fixes
| Symptom | Cause | Fix |
|---------|-------|-----|
| SQL timeout errors | DB CPU spike | Scale up SQL, check missing indexes |
| 503 from inventory-api | Inventory service down | Check inventory pod status in AKS |
| 401 from payment-api | Token expiry / MI issue | Restart pod, check MI role assignments |
| OutOfMemory | Memory leak in new deploy | Roll back deployment |

### Escalation
- Engineering on-call: PagerDuty → orders-api rotation
- Incident commander: Slack #incidents

08Comparison & Decision Tables

With multiple overlapping services in the Azure monitoring stack, knowing which tool to reach for in each situation saves valuable time during incidents and planning. Application Insights is your APM layer for request-level performance, Log Analytics is your query engine for cross-service log correlation, Azure Monitor Metrics gives you real-time numeric dashboards, and distributed tracing shows the full request path across microservices. Use these decision tables as a quick reference when you need to answer a specific observability question — the right tool for the right job means faster resolution and better insights.

When to Use Each Service

Question	Answer / Service
I want to see all HTTP requests and response times for my API	Application Insights — Requests table
I want to see the SQL queries my app is running and how slow they are	Application Insights — Dependencies table
I want to search across all service logs including platform events	Log Analytics — KQL across all tables
I want a chart of CPU and memory over time	Azure Monitor Metrics — metric explorer
I want to be paged when error rate exceeds 5%	Azure Monitor Alert Rule (log search or metric)
I want to see the full path of one slow request across 5 microservices	Distributed Tracing — App Insights E2E or Jaeger
I want to track a custom business event like 'OrderPlaced'	Application Insights — TrackEvent / customEvents
I want to query Key Vault access audit logs	Log Analytics — AzureDiagnostics where ResourceType==VAULTS
I want a dashboard showing SLOs for 5 services	Azure Monitor Workbook or Grafana
I want to detect anomalous exception spikes automatically	Application Insights Smart Detection
I want Prometheus metrics for my K8s workloads	Azure Monitor Metrics + Container Insights + Prometheus scraping

Logging Levels — What to Log at Each Level

Level	ILogger Method	Use For	Production Volume
Critical	LogCritical	App cannot continue — unrecoverable failure	Rare — immediate page
Error	LogError	Operation failed — exception, 5xx, data corruption	Low — alert on any
Warning	LogWarning	Unexpected but handled — retry, DLQ, validation fail	Medium — trend on
Information	LogInformation	Normal business flow — request received, order created	High — sample in prod
Debug	LogDebug	Diagnostic detail — cache hit/miss, intermediate values	None in prod
Trace	LogTrace	Very verbose — method entry/exit, every loop iteration	Never in prod

09Quick Reference Cheat Sheet

This cheat sheet distills the most commonly used KQL patterns and OpenTelemetry setup snippets into copy-paste-ready blocks for daily use. Keep these patterns handy during incident response when you need to quickly query error rates, latency percentiles, or trace a specific operation across services. The NuGet package reference at the bottom ensures you always know exactly which packages to install for each instrumentation scenario — from basic ASP.NET Core auto-instrumentation to full OTel with Prometheus metrics export.

Essential KQL Patterns

// Always start with time filter
| where TimeGenerated > ago(1h)

// Error rate
| summarize ErrorRate = countif(success==false)*100.0/count() by bin(TimeGenerated,5m)

// Latency percentiles
| summarize P50=percentile(duration,50), P95=percentile(duration,95), P99=percentile(duration,99) by name

// Top N
| top 20 by Count desc

// Count distinct
| summarize UniqueUsers = dcount(user_Id)

// Dynamic property
| extend Region = tostring(customDimensions["Region"])

// Time render
| render timechart

// Cross AI resource query
app("MyOtherAppInsights").requests | where timestamp > ago(1h)

// Parse structured log message
| parse Message with "OrderId=" OrderId " status=" Status

OTel .NET 8 Minimal Setup

// Minimal OTel setup for a microservice
builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("my-service", "1.0.0"))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("MyService")
        .AddOtlpExporter())
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddMeter("MyService")
        .AddOtlpExporter())
    .UseAzureMonitor(); // Also send to App Insights

// Custom span
using var activity = MySource.StartActivity("DoSomething");
activity?.SetTag("key", "value");
activity?.AddEvent(new ActivityEvent("StepCompleted"));

// Custom metric
MyCounter.Add(1, new KeyValuePair<string,object?>("region","EU"));

NuGet Package	Purpose
Microsoft.ApplicationInsights.AspNetCore	Classic App Insights SDK for ASP.NET Core
Azure.Monitor.OpenTelemetry.AspNetCore	Modern OTel-based Azure Monitor exporter (recommended)
OpenTelemetry.Extensions.Hosting	OTel host builder extensions
OpenTelemetry.Instrumentation.AspNetCore	Auto-instrument HTTP requests
OpenTelemetry.Instrumentation.Http	Auto-instrument HttpClient calls
OpenTelemetry.Instrumentation.EntityFrameworkCore	Auto-instrument EF Core SQL queries
OpenTelemetry.Instrumentation.Runtime	GC, thread pool, heap metrics
OpenTelemetry.Exporter.OpenTelemetryProtocol	OTLP exporter (Jaeger, Tempo, Collector)
OpenTelemetry.Exporter.Prometheus.AspNetCore	/metrics endpoint for Prometheus scraping
Serilog.AspNetCore + Serilog.Sinks.ApplicationInsights	Structured logging to App Insights via Serilog
Microsoft.ApplicationInsights.WorkerService	App Insights for background services and workers

Monitoring & ObservabilityComplete Guide