πŸ“Š Azure Observability

Monitoring & Observability
Complete Guide

From beginner to production β€” Application Insights telemetry and sampling, Log Analytics KQL queries, Azure Monitor metrics and alert rules, OpenTelemetry distributed tracing in .NET 8, context propagation, SLOs, error budgets, and every observability pattern you need to operate cloud systems with confidence.

Beginner β†’ Architecture26 SectionsOpenTelemetryKQL Deep Dive.NET 8 Examples

01Monitoring & Observability Overview

Modern cloud systems are too complex to debug with guesswork. Observability is the ability to understand a system's internal state from its external outputs β€” the three pillars being logs, metrics, and traces. Azure's monitoring stack maps these pillars to specific services that work together:

πŸ”­
Application Insights
APM (Application Performance Management) β€” request tracking, dependency calls, exceptions, custom events, user analytics, availability tests.
πŸ“‹
Log Analytics
Centralised log store and query engine. KQL-powered queries across logs from every Azure service, custom apps, VMs, and containers.
πŸ“ˆ
Azure Monitor
The platform umbrella β€” collects metrics from every Azure resource, evaluates alert rules, triggers action groups, and powers workbooks.
πŸ•ΈοΈ
Distributed Tracing
End-to-end request tracking across microservices. OpenTelemetry W3C trace context propagation β€” see the full call chain from gateway to database.

The Three Pillars of Observability

PillarWhat It AnswersAzure ServiceData Type
LogsWhat happened? What were the inputs/outputs?Log Analytics / App InsightsStructured JSON events
MetricsHow is the system performing right now?Azure Monitor MetricsNumeric time-series
TracesWhy did this request take so long? What path did it take?App Insights / Jaeger / TempoDistributed spans

How the Stack Fits Together

text
Your Application (.NET 8 / Functions / AKS pods)
        β”‚
        β”‚  emits via OpenTelemetry SDK or App Insights SDK
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Azure Monitor (the platform)              β”‚
β”‚                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Application Insightsβ”‚   β”‚   Log Analytics        β”‚  β”‚
β”‚  β”‚  (APM + traces +    β”‚   β”‚   Workspace            β”‚  β”‚
β”‚  β”‚   exceptions +      β”‚   β”‚   (logs from ALL       β”‚  β”‚
β”‚  β”‚   custom events)    β”‚   β”‚    Azure resources)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Metrics Store      β”‚   β”‚   Alerts Engine        β”‚  β”‚
β”‚  β”‚  (time-series,      β”‚   β”‚   (rules, action       β”‚  β”‚
β”‚  β”‚   auto-collected    β”‚   β”‚    groups, PagerDuty)  β”‚  β”‚
β”‚  β”‚   from all services)β”‚   β”‚                        β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
Workbooks Β· Dashboards Β· Grafana Β· Power BI

02Application Insights

Application Insights is Azure's Application Performance Management (APM)service. It automatically collects request telemetry, dependency calls (SQL, HTTP, Service Bus), exceptions, performance counters, and custom events β€” giving you end-to-end visibility into every user interaction and every background job.

Telemetry TypeAuto-Collected?Description
Requestsβœ“ YesEvery inbound HTTP request β€” duration, status, URL, method
Dependenciesβœ“ YesOutbound HTTP calls, SQL queries, Service Bus, Redis, Storage
Exceptionsβœ“ YesUnhandled exceptions with stack traces and request context
Tracesβœ“ YesILogger output β€” severity, message, properties
Performance Countersβœ“ YesCPU, memory, GC, thread count β€” from host OS
Custom EventsManualBusiness events β€” OrderPlaced, PaymentFailed, FeatureUsed
Custom MetricsManualNumeric measurements β€” queue depth, cache hit rate
Page ViewsJS SDKBrowser page load times, user sessions, demographics
AvailabilityConfiguredSynthetic ping / multi-step tests from global locations

02aSDK Setup & Auto-Instrumentation

Getting Application Insights into your .NET 8 service takes just a few lines of configuration β€” the SDK handles auto-instrumentation of HTTP requests, dependency calls, and exceptions out of the box. The modern approach uses the OpenTelemetry-based Azure Monitor exporter, which gives you vendor-neutral instrumentation with Azure-native export. Always use connection strings (not instrumentation keys) and store them in Key Vault β€” never hardcode secrets in your application code. You can also enrich every telemetry item with custom properties using ITelemetryInitializer, which is invaluable for filtering by service version, environment, or correlation IDs during incident investigation.

Connection-String Based Setup (.NET 8)

csharp
// NuGet: Microsoft.ApplicationInsights.AspNetCore
// NuGet: Azure.Monitor.OpenTelemetry.AspNetCore  ← modern OpenTelemetry-based

// ── Option A: Azure Monitor OpenTelemetry (recommended for new projects) ──
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    .UseAzureMonitor(options =>
    {
        // Connection string from Key Vault / App Settings β€” never hardcode
        options.ConnectionString =
            builder.Configuration["ApplicationInsights:ConnectionString"];

        // Sampling β€” see Section 02c
        options.SamplingRatio = 0.1f; // 10% sampling in production
    });

// ── Option B: Classic App Insights SDK ────────────────────────────────
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString =
        builder.Configuration["ApplicationInsights:ConnectionString"];
    options.EnableAdaptiveSampling = true;
    options.EnableHeartbeat        = true;
    options.EnableDebugLogger      = false; // Off in production
});

// ── appsettings.json ──────────────────────────────────────────────────
// {
//   "ApplicationInsights": {
//     "ConnectionString": "@Microsoft.KeyVault(VaultName=myVault;SecretName=ai-conn-string)"
//   }
// }
//
// Connection string format:
// InstrumentationKey=<key>;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;...

Worker Service / Background Jobs

csharp
// NuGet: Microsoft.ApplicationInsights.WorkerService
builder.Services.AddApplicationInsightsTelemetryWorkerService(options =>
{
    options.ConnectionString =
        builder.Configuration["ApplicationInsights:ConnectionString"];
});

// Azure Functions β€” auto-configured via host.json
// host.json:
// {
//   "logging": {
//     "applicationInsights": {
//       "samplingSettings": { "isEnabled": true, "maxTelemetryItemsPerSecond": 20 }
//     }
//   }
// }

Enriching All Telemetry with Custom Properties

csharp
// ITelemetryInitializer β€” runs on every telemetry item before it's sent
public class ServiceTelemetryInitializer : ITelemetryInitializer
{
    private readonly IHttpContextAccessor _http;

    public ServiceTelemetryInitializer(IHttpContextAccessor http) => _http = http;

    public void Initialize(ITelemetry telemetry)
    {
        // Tag every item with service metadata
        telemetry.Context.Cloud.RoleName     = "orders-api";
        telemetry.Context.Cloud.RoleInstance = Environment.MachineName;

        // Add custom global dimensions
        if (telemetry is ISupportProperties props)
        {
            props.Properties["ServiceVersion"] = "1.4.2";
            props.Properties["Environment"]    = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") ?? "Unknown";

            // Propagate correlation IDs from inbound request headers
            var ctx = _http.HttpContext;
            if (ctx is not null)
            {
                var correlationId = ctx.Request.Headers["X-Correlation-Id"].FirstOrDefault();
                if (!string.IsNullOrEmpty(correlationId))
                    props.Properties["CorrelationId"] = correlationId;

                var userId = ctx.User.FindFirst("oid")?.Value;
                if (!string.IsNullOrEmpty(userId))
                    props.Properties["UserId"] = userId;
            }
        }
    }
}

// Register in DI
services.AddSingleton<ITelemetryInitializer, ServiceTelemetryInitializer>();

02bCustom Telemetry

While auto-instrumentation captures HTTP requests and dependencies, custom telemetry lets you track what matters to your business β€” events like "OrderPlaced" or "PaymentFailed", custom metrics like queue depth or cache hit rate, and manual dependency tracking for non-HTTP calls. This is where observability becomes truly powerful: you can correlate technical performance with business outcomes. Use TelemetryClient for custom events and metrics, ILogger with structured properties for contextual traces, and operation holders to group related telemetry items under a single operation ID for end-to-end correlation.

Tracking Custom Events, Metrics & Dependencies

csharp
public class OrderService(TelemetryClient telemetry, ILogger<OrderService> logger)
{
    public async Task<Order> CreateOrderAsync(CreateOrderCommand cmd, CancellationToken ct)
    {
        // ── Custom Event β€” business milestone ─────────────────────────
        telemetry.TrackEvent("OrderCreated", new Dictionary<string, string>
        {
            ["CustomerId"] = cmd.CustomerId.ToString(),
            ["Region"]     = cmd.Region,
            ["ItemCount"]  = cmd.LineItems.Count.ToString(),
        },
        new Dictionary<string, double>
        {
            ["OrderAmount"] = (double)cmd.TotalAmount,
        });

        // ── Custom Metric β€” numeric measurement ───────────────────────
        telemetry.TrackMetric("Order.TotalAmount", (double)cmd.TotalAmount,
            new Dictionary<string, string> { ["Region"] = cmd.Region });

        // ── Custom Dependency β€” any external call ─────────────────────
        var startTime = DateTimeOffset.UtcNow;
        var timer     = Stopwatch.StartNew();
        bool success  = false;
        try
        {
            var result = await _inventoryClient.ReserveItemsAsync(cmd.LineItems, ct);
            success = result.IsSuccess;
            return result.Value;
        }
        catch (Exception ex)
        {
            // ── Exception with extra context ──────────────────────────
            telemetry.TrackException(ex, new Dictionary<string, string>
            {
                ["OrderId"]    = cmd.OrderId.ToString(),
                ["CustomerId"] = cmd.CustomerId.ToString(),
                ["Operation"]  = "CreateOrder",
            });
            throw;
        }
        finally
        {
            timer.Stop();
            telemetry.TrackDependency(
                dependencyTypeName: "gRPC",
                target:   "inventory-api",
                dependencyName: "InventoryClient.ReserveItems",
                data:     $"CustomerId={cmd.CustomerId}",
                startTime: startTime,
                duration:  timer.Elapsed,
                resultCode: success ? "200" : "500",
                success:   success);
        }
    }
}

// ── Structured logging β€” ILogger feeds into App Insights traces ───────
logger.LogInformation(
    "Order {OrderId} created for customer {CustomerId} β€” amount {Amount:C}",
    order.Id, order.CustomerId, order.TotalAmount);

// ── Using scopes for correlated log groups ────────────────────────────
using (logger.BeginScope(new Dictionary<string, object>
{
    ["OrderId"]    = order.Id,
    ["CustomerId"] = order.CustomerId,
    ["RequestId"]  = Activity.Current?.TraceId.ToString() ?? ""
}))
{
    logger.LogInformation("Starting payment processing");
    await ProcessPaymentAsync(order, ct);
    logger.LogInformation("Payment processed successfully");
}

Operation Tracking β€” Grouping Related Telemetry

csharp
// Use IOperationHolder to group related telemetry items
// All items inside the using block share the same operation ID
public async Task ProcessMessageAsync(ServiceBusReceivedMessage message)
{
    using var operation = telemetry.StartOperation<RequestTelemetry>(
        "ServiceBus.ProcessOrder");

    operation.Telemetry.Properties["MessageId"]  = message.MessageId;
    operation.Telemetry.Properties["QueueName"]  = "orders";

    try
    {
        var order = JsonSerializer.Deserialize<OrderCreatedEvent>(message.Body);
        await ProcessOrderInternalAsync(order!);

        operation.Telemetry.Success    = true;
        operation.Telemetry.ResponseCode = "200";
    }
    catch (Exception ex)
    {
        operation.Telemetry.Success    = false;
        operation.Telemetry.ResponseCode = "500";
        telemetry.TrackException(ex);
        throw;
    }
}

02cSampling & Cost Control

Application Insights charges per GB ingested. For high-traffic services, sampling is essential β€” it reduces data volume while preserving statistical accuracy and keeping correlated telemetry together (all spans of one trace are either all sampled or all dropped).

Sampling TypeWhere ConfiguredHow It WorksBest For
Adaptive SamplingSDK (default on)Auto-adjusts rate to stay under target events/secVariable traffic β€” production default
Fixed-Rate SamplingSDK configFixed % of operations sampled β€” predictable volumePredictable billing, A/B comparison
Ingestion SamplingApp Insights portalDrops data after arrival β€” no SDK change neededQuick cost reduction without redeployment
OpenTelemetry SamplingOTel SDKHead-based or tail-based at trace levelModern OTel pipelines
csharp
// Adaptive sampling β€” recommended for most services
services.AddApplicationInsightsTelemetry(options =>
{
    options.EnableAdaptiveSampling = true;
});

services.Configure<TelemetryConfiguration>(config =>
{
    var adaptiveSamplingProcessor = new AdaptiveSamplingTelemetryProcessor(
        new SamplingPercentageEstimatorSettings
        {
            MaxTelemetryItemsPerSecond = 5,     // Target: max 5 events/sec per instance
            MinSamplingPercentage      = 0.1,   // Never sample less than 0.1%
            MaxSamplingPercentage      = 100,   // Full sampling when low traffic
            EvaluationInterval         = TimeSpan.FromSeconds(15),
            SamplingPercentageDecreaseTimeout = TimeSpan.FromMinutes(2),
            SamplingPercentageIncreaseTimeout = TimeSpan.FromMinutes(15),
        },
        next: config.TelemetryProcessorChainBuilder.Build());

    config.TelemetryProcessorChainBuilder
        .Use(_ => adaptiveSamplingProcessor)
        .Build();
});

// ── Fixed-rate sampling β€” predictable volume ──────────────────────────
services.Configure<TelemetryConfiguration>(config =>
{
    config.TelemetryProcessorChainBuilder
        .UseSampling(samplingPercentage: 10) // Sample 10% of operations
        .Build();
});

// ── Exclude specific telemetry from sampling ──────────────────────────
public class ExcludeHealthCheckFilter : ITelemetryProcessor
{
    private readonly ITelemetryProcessor _next;

    public ExcludeHealthCheckFilter(ITelemetryProcessor next) => _next = next;

    public void Process(ITelemetry item)
    {
        // Don't send health check telemetry to App Insights at all
        if (item is RequestTelemetry req &&
            req.Url?.AbsolutePath.StartsWith("/health") == true)
            return;

        _next.Process(item);
    }
}

services.Configure<TelemetryConfiguration>(config =>
{
    config.TelemetryProcessorChainBuilder
        .Use(next => new ExcludeHealthCheckFilter(next))
        .UseSampling(10)
        .Build();
});
πŸ’°
Sampling preserves trace integrityWhen a request is sampled, App Insights samples all dependent telemetry with the same operation ID β€” the complete trace stays together or is dropped together. You never see a request without its dependencies, or a dependency without its parent request.

02dAvailability Tests

Availability tests run synthetic requests to your endpoints from Azure global PoP locations on a schedule β€” detecting outages from the user's perspective before your monitoring picks them up internally.

Test TypeDescriptionBest For
Standard (URL ping)Simple HTTP GET/POST to one URL β€” checks status code and response timeAPI health endpoints, uptime SLA monitoring
Multi-step (TrackAvailability)Custom code simulating a user journey β€” login, search, checkoutCritical user flows, end-to-end smoke tests
Custom TrackAvailabilityEmit availability telemetry from your own infrastructurePrivate endpoints not reachable from Azure PoPs
csharp
// Custom availability test via Azure Function (for private endpoints)
[FunctionName("AvailabilityTest")]
public async Task Run([TimerTrigger("0 */5 * * * *")] TimerInfo timer)
{
    var testName    = "Orders API β€” Create Order Flow";
    var runLocation = "East US";
    var startTime   = DateTimeOffset.UtcNow;
    var timer2      = Stopwatch.StartNew();
    bool success    = false;
    string message  = "";

    try
    {
        // Step 1: Authenticate
        var token = await GetTestTokenAsync();

        // Step 2: Create a test order
        _httpClient.DefaultRequestHeaders.Authorization =
            new AuthenticationHeaderValue("Bearer", token);

        var response = await _httpClient.PostAsJsonAsync(
            "/api/v1/orders",
            new { customerId = TestCustomerId, lineItems = TestLineItems });

        success = response.IsSuccessStatusCode;
        message = $"Status: {(int)response.StatusCode}";

        // Step 3: Verify the order exists
        if (success)
        {
            var order = await response.Content.ReadFromJsonAsync<OrderResponse>();
            var getResponse = await _httpClient.GetAsync($"/api/v1/orders/{order!.OrderId}");
            success = getResponse.IsSuccessStatusCode;
            message += $" | GET: {(int)getResponse.StatusCode}";
        }
    }
    catch (Exception ex)
    {
        success = false;
        message = ex.Message;
    }
    finally
    {
        timer2.Stop();
        _telemetry.TrackAvailability(
            name:      testName,
            timeStamp: startTime,
            duration:  timer2.Elapsed,
            runLocation: runLocation,
            success:   success,
            message:   message);
    }
}

02eApplication Map & Live Metrics

Application Map automatically renders your service topology β€” nodes for each component, edges for dependency calls, failure rates and latency on each edge. Live Metrics streams real-time telemetry with sub-second latency β€” essential during deployments and incident response.

kql
// ── App Insights KQL β€” Failure rate by operation in last 1h ─────────
requests
| where timestamp > ago(1h)
| summarize
    Total    = count(),
    Failed   = countif(success == false),
    P95_ms   = percentile(duration, 95),
    P99_ms   = percentile(duration, 99)
  by name
| extend FailureRate = round(100.0 * Failed / Total, 2)
| where Total > 10
| order by FailureRate desc

// ── Slow dependency calls (SQL > 1s) ──────────────────────────────────
dependencies
| where timestamp > ago(1h)
| where type == "SQL"
| where duration > 1000
| project timestamp, name, data, duration, success, resultCode
| order by duration desc
| take 50

// ── Top exceptions in last 24h ────────────────────────────────────────
exceptions
| where timestamp > ago(24h)
| summarize Count = count() by type, outerMessage
| order by Count desc
| take 20

// ── User journey β€” trace one operation end-to-end ─────────────────────
let opId = "abc123def456";
union requests, dependencies, exceptions, traces
| where operation_Id == opId
| project timestamp, itemType, name, duration, success, message, type
| order by timestamp asc

03Log Analytics

Log Analytics Workspace is Azure's centralised log aggregation and query engine. Every Azure service can ship its diagnostic logs here. Your applications send structured logs via the App Insights SDK or the OTel OTLP exporter. Queries are written in KQL (Kusto Query Language) β€” a powerful, expressive SQL-like language purpose-built for log analytics.

Log SourceHow to ConnectKey Tables
Azure App ServiceDiagnostic Settings β†’ Log AnalyticsAppServiceHTTPLogs, AppServiceConsoleLogs
Azure FunctionsDiagnostic Settings + AI SDKFunctionAppLogs, requests, traces
Azure Kubernetes ServiceContainer Insights add-onContainerLog, KubePodInventory, KubeEvents
Azure API ManagementDiagnostic SettingsApiManagementGatewayLogs
Azure Service BusDiagnostic SettingsAzureDiagnostics (ResourceType=NAMESPACES)
Azure Key VaultDiagnostic SettingsAzureDiagnostics (ResourceType=VAULTS)
Application InsightsLinked workspace (auto)requests, dependencies, exceptions, traces
Custom AppOTel OTLP exporter / DCR APICustom table or AppTraces
Azure Activity LogExport to workspaceAzureActivity
VM / ArcAzure Monitor AgentSyslog, SecurityEvent, Event

03aWorkspace Design

How you structure your Log Analytics workspaces determines your query capabilities, access control boundaries, and cost efficiency. The recommended pattern for most organizations is a centralized workspace per environment with table-level RBAC β€” this enables cross-service correlation queries while maintaining security isolation. Always link Application Insights to a workspace so you can join APM data with platform logs in a single KQL query. Consider commitment tiers once you exceed 100 GB/day ingestion, as they offer up to 30% savings over pay-as-you-go pricing.

DecisionRecommendationReason
Workspaces per environmentOne per environment (dev/staging/prod)Isolation β€” no dev noise in prod queries
Workspaces per regionOne per region if data residency requiredCompliance β€” data stays in region
Centralised vs per-teamCentral workspace + table-level RBACCost efficiency + cross-service correlation
Retention period30 days hot, up to 730 days archiveBalance cost vs forensics requirement
Commitment tierUse commitment tier at >100 GB/dayUp to 30% discount over pay-as-you-go
Linked App InsightsAlways link AI to a workspaceUnified queries across AI + platform logs
bash
# Create Log Analytics Workspace
az monitor log-analytics workspace create \
  --resource-group myRG \
  --workspace-name myPlatformWorkspace \
  --location eastus \
  --retention-time 90 \
  --sku PerGB2018

# Get workspace ID (for App Insights linking)
WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group myRG \
  --workspace-name myPlatformWorkspace \
  --query id --output tsv)

# Create Application Insights linked to the workspace
az monitor app-insights component create \
  --app myOrdersApi \
  --resource-group myRG \
  --location eastus \
  --workspace "$WORKSPACE_ID" \
  --kind web

# Enable diagnostic settings for Service Bus β†’ Workspace
az monitor diagnostic-settings create \
  --resource "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ServiceBus/namespaces/<ns>" \
  --name "sb-diagnostics" \
  --workspace "$WORKSPACE_ID" \
  --logs '[{"category":"OperationalLogs","enabled":true},{"category":"VNetAndIPFilteringLogs","enabled":true}]' \
  --metrics '[{"category":"AllMetrics","enabled":true}]'

03bKQL β€” Core Queries

KQL (Kusto Query Language) is the query language for Log Analytics and Application Insights β€” it is purpose-built for exploring large volumes of telemetry data with sub-second response times. Unlike SQL, KQL uses a pipe-based syntax where each operator transforms the result set flowing through it, making complex queries readable and composable. Always start with a time filter to limit the data scanned, use has instead of contains for better performance on word-boundary matches, and leverage summarize with bin() for time-series aggregations that render beautifully as charts.

KQL Cheat Sheet β€” Essential Operators

kql
// ── Filtering ─────────────────────────────────────────────────────────
TableName
| where TimeGenerated > ago(1h)                   // Time filter (always first!)
| where Level == "Error"                           // Exact match
| where Message contains "timeout"                // Case-insensitive substring
| where Message has "OrderId"                      // Faster than contains for words
| where StatusCode between (500 .. 599)            // Numeric range
| where isnotempty(CorrelationId)                  // Not null/empty
| where Properties.region in ("EU", "US")         // Value in list

// ── Projection ────────────────────────────────────────────────────────
| project TimeGenerated, Level, Message, Properties  // Select columns
| project-away TenantId, SubscriptionId              // Drop columns
| extend DurationSec = Duration / 1000.0             // Add computed column
| parse Message with "OrderId=" OrderId " amount=" Amount:double " region=" Region

// ── Aggregation ───────────────────────────────────────────────────────
| summarize
    Count       = count(),
    ErrorCount  = countif(Level == "Error"),
    AvgDuration = avg(Duration),
    P95Duration = percentile(Duration, 95),
    MaxDuration = max(Duration)
  by bin(TimeGenerated, 5m), ServiceName

// ── Sorting & Limiting ────────────────────────────────────────────────
| order by TimeGenerated desc
| top 100 by Duration desc        // Top N by a column

// ── Joining tables ────────────────────────────────────────────────────
requests
| join kind=leftouter (
    exceptions
    | project operation_Id, exceptionType = type, exceptionMsg = outerMessage
) on operation_Id

// ── String operations ─────────────────────────────────────────────────
| extend OrderId = extract("OrderId=([a-f0-9-]+)", 1, Message)
| extend Domain  = split(Email, "@")[1]
| where ServiceName startswith "order"
| where Url matches regex @"/api/v[0-9]+/orders/[a-f0-9-]+"

// ── Time operations ───────────────────────────────────────────────────
| extend HourOfDay  = hourofday(TimeGenerated)
| extend DayOfWeek  = dayofweek(TimeGenerated)
| where TimeGenerated between (datetime(2026-05-01) .. datetime(2026-05-08))

Essential Production Queries

kql
// ── Error rate trend (5-min buckets) ─────────────────────────────────
requests
| where TimeGenerated > ago(3h)
| summarize
    Total  = count(),
    Failed = countif(success == false)
  by bin(TimeGenerated, 5m), cloud_RoleName
| extend ErrorRate = round(100.0 * Failed / Total, 2)
| render timechart

// ── P50/P95/P99 latency by endpoint ──────────────────────────────────
requests
| where TimeGenerated > ago(1h)
| summarize
    P50 = percentile(duration, 50),
    P95 = percentile(duration, 95),
    P99 = percentile(duration, 99),
    Count = count()
  by name
| where Count > 100
| order by P99 desc

// ── Dependency failures β€” which external calls are breaking? ──────────
dependencies
| where TimeGenerated > ago(1h)
| where success == false
| summarize
    FailureCount = count(),
    AvgDuration  = avg(duration),
    Targets      = make_set(target)
  by type, name
| order by FailureCount desc

// ── Exceptions β€” top 10 with sample messages ─────────────────────────
exceptions
| where TimeGenerated > ago(24h)
| summarize
    Count   = count(),
    Sample  = any(outerMessage),
    LastSeen = max(TimeGenerated)
  by type
| top 10 by Count desc

// ── Slow SQL queries ──────────────────────────────────────────────────
dependencies
| where TimeGenerated > ago(1h)
| where type == "SQL"
| where duration > 500
| project TimeGenerated, name, data, duration, resultCode
| order by duration desc
| take 20

// ── Container restart loop detection ─────────────────────────────────
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "OOMKilled" or LogEntry contains "CrashLoopBackOff"
| summarize Restarts = count() by ContainerID, _ResourceId
| where Restarts > 3
| order by Restarts desc

03cKQL β€” Advanced Patterns

Advanced KQL patterns unlock powerful capabilities like anomaly detection, cross-workspace correlation, funnel analysis, and dynamic JSON parsing β€” these are the queries that power production alert rules and executive dashboards. Error spike detection compares current error counts against a rolling baseline to catch regressions without hardcoded thresholds. Cross-workspace queries let you correlate security events with application failures, while funnel analysis tracks user drop-off through multi-step business flows. Master these patterns and you can answer almost any operational or business question from your telemetry data alone.

kql
// ── Alerting query: error spike detection (> 2x baseline) ───────────
let baseline = requests
    | where TimeGenerated between (ago(2h) .. ago(1h))
    | summarize BaselineErrors = countif(success == false);
let current = requests
    | where TimeGenerated > ago(1h)
    | summarize CurrentErrors = countif(success == false);
current
| cross join baseline
| where CurrentErrors > BaselineErrors * 2 and CurrentErrors > 10
| project CurrentErrors, BaselineErrors,
          Increase = round(100.0 * (CurrentErrors - BaselineErrors) / BaselineErrors, 1)

// ── User impact analysis β€” how many users hit errors? ─────────────────
requests
| where TimeGenerated > ago(1h)
| where success == false
| summarize
    FailedRequests = count(),
    AffectedUsers  = dcount(user_Id),
    AffectedOps    = dcount(operation_Id)
  by name, resultCode
| order by AffectedUsers desc

// ── Funnel analysis β€” order creation drop-off ─────────────────────────
let step1 = customEvents | where name == "CartViewed"     | summarize s1 = dcount(user_Id);
let step2 = customEvents | where name == "CheckoutStarted" | summarize s2 = dcount(user_Id);
let step3 = customEvents | where name == "OrderSubmitted"  | summarize s3 = dcount(user_Id);
let step4 = customEvents | where name == "OrderConfirmed"  | summarize s4 = dcount(user_Id);
step1 | cross join step2 | cross join step3 | cross join step4
| project
    CartViewed        = s1,
    CheckoutStarted   = s2,
    OrderSubmitted    = s3,
    OrderConfirmed    = s4,
    CheckoutConversion = round(100.0 * s2 / s1, 1),
    SubmitConversion   = round(100.0 * s3 / s2, 1),
    FinalConversion    = round(100.0 * s4 / s1, 1)

// ── Cross-workspace query β€” correlate App Insights + Security logs ────
workspace("SecurityWorkspace").SecurityEvent
| where TimeGenerated > ago(1h)
| where EventID == 4625 // Failed login
| join kind=inner (
    app("OrdersAppInsights").requests
    | where success == false
    | project operation_Id, clientIP = client_IP
) on $left.IpAddress == $right.clientIP
| project TimeGenerated, IpAddress, Account, operation_Id

// ── Dynamic columns from JSON properties ──────────────────────────────
customEvents
| where name == "OrderCreated"
| extend
    OrderId  = tostring(customDimensions["OrderId"]),
    Amount   = todouble(customDimensions["OrderAmount"]),
    Region   = tostring(customDimensions["Region"])
| summarize Revenue = sum(Amount), Orders = count() by Region, bin(TimeGenerated, 1h)
| render timechart

03dKey Log Tables Reference

Every Azure service and Application Insights telemetry type maps to a specific table in Log Analytics β€” knowing which table to query is half the battle during incident response. The requests and dependencies tables are your go-to for API performance, while AzureDiagnostics is the catch-all for platform resource logs filtered by ResourceType. Bookmark this reference so you can jump straight to the right table when debugging production issues instead of guessing table names under pressure.

TableSourceKey Columns
requestsApp Insightsname, duration, success, resultCode, url, operation_Id, cloud_RoleName
dependenciesApp Insightsname, type, target, data, duration, success, resultCode, operation_Id
exceptionsApp Insightstype, outerMessage, innermostMessage, stack, operation_Id, cloud_RoleName
tracesApp Insightsmessage, severityLevel, customDimensions, operation_Id, cloud_RoleName
customEventsApp Insightsname, customDimensions, customMeasurements, operation_Id, user_Id
customMetricsApp Insightsname, value, valueCount, valueSum, valueMin, valueMax
availabilityResultsApp Insightsname, success, duration, location, message, runLocation
AzureDiagnosticsAzure ResourcesResourceType, OperationName, ResultType, Level, CallerIpAddress
AzureActivityAzure RBAC / ControlOperationName, Caller, ResourceGroup, ActivityStatus, Level
ContainerLogAKS / Container InsightsContainerID, LogEntry, LogEntrySource, _ResourceId
KubePodInventoryAKSPodName, Namespace, PodStatus, ContainerStatusReason, Node
KubeEventsAKSName, Namespace, Reason, Message, KubeEventType
AppServiceHTTPLogsApp ServiceCsMethod, CsUriStem, ScStatus, TimeTaken, CIp
FunctionAppLogsAzure FunctionsHostInstanceId, Message, ExceptionMessage, FunctionName, Level
ApiManagementGatewayLogsAPIMApiId, OperationId, ResponseCode, TotalTime, ClientIp
SigninLogsEntra IDUserPrincipalName, IPAddress, ResultType, ConditionalAccessStatus

04Azure Monitor

Azure Monitor is the platform-level observability service β€” it automatically collects metrics from every Azure resource (no agent needed), evaluates alert rules, triggers action groups, and aggregates data into workbooks and dashboards. It is the umbrella that App Insights and Log Analytics sit within.

Azure Monitor FeatureDescription
Platform MetricsAuto-collected numeric metrics from every Azure resource β€” CPU, requests, queue depth, etc.
Custom MetricsEmit your own time-series from apps via OTel or App Insights SDK
Alert RulesEvaluate metric/log conditions and fire on threshold breach
Action GroupsNotifications and automation triggered by alerts β€” email, SMS, webhook, PagerDuty, Logic App
WorkbooksInteractive parameterised reports mixing KQL, metrics, and markdown
DashboardsPinnable metric charts and tiles for NOC-style displays
AutoscaleScale Azure resources (App Service, VMSS) based on metric thresholds
Change AnalysisTrack infrastructure changes correlated with incidents
Service HealthAzure platform incidents affecting your specific resources

04aMetrics & Dimensions

Azure Monitor automatically collects platform metrics from every resource β€” CPU, memory, request counts, queue depths β€” with no agent or SDK required. These numeric time-series are stored for 93 days and can be queried with sub-minute granularity, making them ideal for real-time dashboards and alert rules. For application-specific measurements, you can emit custom metrics via the OpenTelemetry Meter API using counters, histograms, and gauges with dimensional tags that enable powerful filtering and grouping in metric explorer.

Key Metrics Per Service

ServiceCritical Metrics to Monitor
App Service / FunctionsCpuPercentage, MemoryWorkingSet, HttpResponseTime, HttpServerErrors, Requests
Azure SQL / SQL MIcpu_percent, dtu_consumption_percent, deadlock, connection_failed, storage_percent
Azure Service BusActiveMessages, DeadletteredMessages, IncomingMessages, ThrottledRequests, ServerErrors
Azure Event HubsIncomingMessages, OutgoingMessages, ThrottledRequests, IncomingBytes, CapturedMessages
Azure Event GridPublishSuccessCount, DeliverySuccessCount, DeadLetteredCount, DroppedEventCount
Azure Key VaultServiceApiHit, ServiceApiLatency, ServiceApiResult (failures)
Azure API ManagementTotalRequests, SuccessfulRequests, FailedRequests, Duration, Capacity
Azure Container AppsReplicas, CpuUsage, MemoryUsage, RequestCount, ResponseTime
AKSnode_cpu_usage_percentage, node_memory_working_set_percentage, kube_pod_status_phase
Azure Cosmos DBTotalRequests, TotalRequestUnits, ServerSideLatency, NormalizedRUConsumption

Custom Metrics via OpenTelemetry (.NET 8)

csharp
using System.Diagnostics.Metrics;

// Define meters and instruments (static β€” created once)
public static class OrdersMetrics
{
    private static readonly Meter Meter = new("Orders.Api", "1.0.0");

    // Counter β€” monotonically increasing
    public static readonly Counter<long> OrdersCreated =
        Meter.CreateCounter<long>(
            name: "orders.created.total",
            unit: "orders",
            description: "Total number of orders created");

    // Histogram β€” distribution of values (latency, sizes)
    public static readonly Histogram<double> OrderProcessingDuration =
        Meter.CreateHistogram<double>(
            name: "orders.processing.duration",
            unit: "ms",
            description: "Time to process an order end-to-end");

    // ObservableGauge β€” current value from a callback
    public static readonly ObservableGauge<int> PendingOrders =
        Meter.CreateObservableGauge<int>(
            name: "orders.pending.count",
            observeValue: () => OrderRepository.GetPendingCount(),
            unit: "orders",
            description: "Current number of pending orders");

    // UpDownCounter β€” can increase and decrease
    public static readonly UpDownCounter<int> ActiveConnections =
        Meter.CreateUpDownCounter<int>(
            name: "orders.active.connections",
            unit: "connections");
}

// Usage in command handler
public async Task<Result<OrderResponse>> Handle(CreateOrderCommand cmd, CancellationToken ct)
{
    var sw = Stopwatch.StartNew();
    try
    {
        var order = Order.Create(cmd.CustomerId, cmd.LineItems);
        await _repository.SaveAsync(order, ct);

        // Record with dimensions (tags)
        OrdersMetrics.OrdersCreated.Add(1,
            new KeyValuePair<string, object?>("region",   cmd.Region),
            new KeyValuePair<string, object?>("channel",  cmd.Channel),
            new KeyValuePair<string, object?>("priority", cmd.IsPriority ? "high" : "normal"));

        return Result.Success(OrderResponse.From(order));
    }
    finally
    {
        sw.Stop();
        OrdersMetrics.OrderProcessingDuration.Record(sw.Elapsed.TotalMilliseconds,
            new KeyValuePair<string, object?>("region", cmd.Region));
    }
}

// Register meters with OTel
builder.Services.AddOpenTelemetry()
    .WithMetrics(m => m
        .AddMeter("Orders.Api")
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()           // GC, thread pool, heap
        .AddProcessInstrumentation()           // CPU, memory
        .AddOtlpExporter()                     // β†’ Azure Monitor / Prometheus
        .AddPrometheusExporter());             // /metrics endpoint for Prometheus scraping

04bAlert Rules & Action Groups

Alert rules are the automated sentinels of your production environment β€” they continuously evaluate conditions against your metrics and logs, firing notifications when thresholds are breached. Action groups define who gets notified and how: email, SMS, webhook to PagerDuty, or even triggering a Logic App for automated remediation. The key to effective alerting is combining metric alerts for real-time threshold detection with log search alerts for complex pattern matching, while keeping severity levels aligned with actual user impact to prevent alert fatigue.

Alert TypeCondition EvaluatedBest For
Metric AlertNumeric metric crosses thresholdCPU > 80%, error rate > 5%, queue depth > 1000
Log Search AlertKQL query returns rowsDLQ messages, specific error patterns, security events
Activity Log AlertAzure control-plane event occursResource deleted, role assigned, policy violated
Smart DetectionAI-detected anomalies in App InsightsSudden increase in exceptions, degraded response time
Resource Health AlertAzure resource enters unhealthy stateVM unavailable, SQL inaccessible, App Service down

Create Alert Rule via CLI

bash
# ── Step 1: Create Action Group (who gets notified) ─────────────────
az monitor action-group create \
  --resource-group myRG \
  --name platform-oncall \
  --short-name oncall \
  --action email oncall-email ops-team@contoso.com \
  --action webhook pagerduty https://events.pagerduty.com/integration/<key>/enqueue

# ── Step 2: Metric Alert β€” Service Bus DLQ > 0 ──────────────────────
az monitor metrics alert create \
  --name "ServiceBus-DLQ-Alert" \
  --resource-group myRG \
  --scopes "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ServiceBus/namespaces/<ns>" \
  --condition "avg DeadletteredMessages > 0" \
  --window-size 5m \
  --evaluation-frequency 1m \
  --severity 2 \
  --action "/subscriptions/<sub>/resourceGroups/myRG/providers/microsoft.insights/actionGroups/platform-oncall" \
  --description "Messages in Service Bus dead-letter queue"

# ── Step 3: Log Alert β€” 5xx errors spike ─────────────────────────────
az monitor scheduled-query create \
  --name "High-Error-Rate-Alert" \
  --resource-group myRG \
  --scopes "<app-insights-resource-id>" \
  --condition-query "requests | where timestamp > ago(5m) | summarize FailRate = countif(success==false)*100.0/count() | where FailRate > 5" \
  --condition-time-aggregation Count \
  --condition-operator GreaterThan \
  --condition-threshold 0 \
  --evaluation-period 5 \
  --evaluation-frequency 1 \
  --severity 1 \
  --action-groups "/subscriptions/<sub>/resourceGroups/myRG/providers/microsoft.insights/actionGroups/platform-oncall"

Production Alert Rules Starter Pack

Alert NameMetric / QueryThresholdSeverity
High Error Raterequests β€” failure rate> 5% over 5 minSev 1
P99 Latency Spikerequests β€” P99 duration> 3000 msSev 2
DLQ MessagesDeadletteredMessages> 0Sev 2
CPU Sustained HighCpuPercentage> 85% for 15 minSev 2
Memory Near LimitMemoryWorkingSet> 90% of limitSev 2
Availability FailedavailabilityResults β€” success==falseAny failureSev 1
Key Vault UnauthorizedAzureDiagnostics β€” ResultType!=SuccessAnySev 1
Pod CrashLoopBackOffKubeEvents β€” Reason==BackOffAnySev 2
Storage Near CapacityUsedCapacity> 80% of limitSev 3
Exception Spikeexceptions β€” count vs 1h baseline> 2x baselineSev 2

04cWorkbooks & Dashboards

Workbooks are interactive, parameterised reports that combine KQL queries, Azure Metrics, markdown text, and ARM data into a single document. They are ideal for SLO reports, incident postmortems, capacity planning, and team-facing health dashboards.

json
// Workbook ARM template structure (simplified)
{
  "type": "microsoft.insights/workbooks",
  "properties": {
    "displayName": "Orders API β€” SLO Dashboard",
    "serializedData": {
      "version": "Notebook/1.0",
      "items": [
        {
          "type": 9,          // Parameters
          "content": {
            "parameters": [
              {
                "id": "timeRange",
                "version": "KqlParameterItem/1.0",
                "name": "TimeRange",
                "type": 4,
                "value": { "durationMs": 3600000 }
              },
              {
                "id": "service",
                "name": "ServiceName",
                "type": 2,             // Drop-down
                "query": "requests | summarize by cloud_RoleName"
              }
            ]
          }
        },
        {
          "type": 3,          // Query item (chart)
          "content": {
            "query": "requests | where cloud_RoleName == '{ServiceName}' | summarize ErrorRate = countif(success==false)*100.0/count() by bin(TimeGenerated, 5m) | render timechart",
            "visualization": "timechart",
            "title": "Error Rate % β€” {ServiceName}"
          }
        }
      ]
    }
  }
}

04dDiagnostic Settings

Diagnostic settings control where an Azure resource sends its platform logs and metrics. Every production resource should have diagnostic settings routing to a Log Analytics workspace.

bicep
// Bicep β€” apply diagnostic settings to every resource via module
module diagnostics 'modules/diagnostics.bicep' = {
  name: 'orders-api-diagnostics'
  params: {
    resourceId: ordersAppService.id
    workspaceId: logAnalyticsWorkspace.id
    retentionDays: 90
  }
}

// modules/diagnostics.bicep
resource diagnosticSetting 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'send-to-workspace'
  scope: resourceId
  properties: {
    workspaceId: workspaceId
    logs: [
      { categoryGroup: 'allLogs',   enabled: true,  retentionPolicy: { days: retentionDays, enabled: true } }
    ]
    metrics: [
      { category: 'AllMetrics', enabled: true, retentionPolicy: { days: retentionDays, enabled: true } }
    ]
  }
}
⚠️
Enable Diagnostic Settings at Deployment TimeDiagnostic settings are not enabled by default. Missing settings means missing logs during an incident. Enforce them with Azure Policy (Deploy-Diagnostics initiative) so every new resource automatically ships logs to your workspace.

05Distributed Tracing

Distributed tracing answers the question: "What happened to this request as it moved across my 12 microservices?" Each service creates a span (a named, timed operation). Spans are linked by a sharedtrace-id propagated in HTTP headers and message properties. The resulting tree of spans is a trace β€” showing the full causal chain, timing, and errors.

ConceptDefinition
TraceThe entire journey of one request across all services β€” a tree of spans
SpanA single named, timed operation within one service (HTTP handler, DB query, cache lookup)
TraceId128-bit ID unique to one trace β€” shared by all spans in that trace
SpanId64-bit ID unique to one span β€” used as parent reference by child spans
ParentSpanIdThe SpanId of the span that created this one β€” builds the tree structure
BaggageKey/value pairs propagated through the entire trace β€” for cross-cutting context
W3C TraceContextHTTP header standard: traceparent + tracestate β€” the propagation format
OTLPOpenTelemetry Protocol β€” standard wire format for exporting traces, metrics, logs
SamplingDecision to record or discard a trace β€” head-based (at trace start) or tail-based

05aOpenTelemetry in .NET 8

OpenTelemetry (OTel) is the vendor-neutral, CNCF-backed standard for instrumentation β€” it lets you collect traces, metrics, and logs with a single SDK and export to any backend (Azure Monitor, Jaeger, Datadog, Grafana). In .NET 8, OTel is a first-class citizen with native support via System.Diagnostics.Activity for tracing and System.Diagnostics.Metrics for metrics. The setup configures resource metadata (service name, version, environment), adds auto-instrumentation for ASP.NET Core, HttpClient, and EF Core, then exports via OTLP β€” giving you full observability with minimal code changes.

Full OTel Setup β€” Tracing + Metrics + Logs

csharp
// NuGet packages needed:
// OpenTelemetry.Extensions.Hosting
// OpenTelemetry.Instrumentation.AspNetCore
// OpenTelemetry.Instrumentation.Http
// OpenTelemetry.Instrumentation.EntityFrameworkCore
// OpenTelemetry.Instrumentation.Runtime
// OpenTelemetry.Exporter.OpenTelemetryProtocol
// Azure.Monitor.OpenTelemetry.AspNetCore

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddOpenTelemetry()
    // ── Resource metadata (common tags on all signals) ─────────────
    .ConfigureResource(resource => resource
        .AddService(
            serviceName:    "orders-api",
            serviceVersion: "2.1.0",
            serviceInstanceId: Environment.MachineName)
        .AddAttributes(new Dictionary<string, object>
        {
            ["deployment.environment"] = builder.Environment.EnvironmentName,
            ["service.namespace"]      = "com.myplatform",
            ["k8s.pod.name"]           = Environment.GetEnvironmentVariable("HOSTNAME") ?? "",
        }))

    // ── Tracing ────────────────────────────────────────────────────
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException        = true;
            options.EnrichWithHttpRequest  = (activity, request) =>
            {
                activity.SetTag("http.request.body_size",
                    request.ContentLength ?? 0);
            };
            options.EnrichWithHttpResponse = (activity, response) =>
            {
                activity.SetTag("http.response.body_size",
                    response.ContentLength ?? 0);
            };
            // Don't trace health checks β€” too noisy
            options.Filter = ctx =>
                !ctx.Request.Path.StartsWithSegments("/health");
        })
        .AddHttpClientInstrumentation(options =>
        {
            options.RecordException = true;
            // Redact auth headers from traces
            options.EnrichWithHttpRequestMessage = (activity, request) =>
            {
                request.Headers.Remove("Authorization");
            };
        })
        .AddEntityFrameworkCoreInstrumentation(options =>
        {
            options.SetDbStatementForText        = true;  // Capture SQL
            options.SetDbStatementForStoredProcedure = true;
        })
        .AddSource("Orders.Api")           // Custom ActivitySource
        .AddSource("MassTransit")          // MassTransit instrumentation
        .AddOtlpExporter(options =>
        {
            options.Endpoint = new Uri(
                builder.Configuration["Otel:Endpoint"] ?? "http://localhost:4317");
            options.Protocol = OtlpExportProtocol.Grpc;
        }))

    // ── Metrics ────────────────────────────────────────────────────
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddProcessInstrumentation()
        .AddMeter("Orders.Api")
        .AddOtlpExporter())

    // ── Logging ────────────────────────────────────────────────────
    .WithLogging(logging => logging
        .AddOtlpExporter());

// Also export to Azure Monitor (Application Insights) simultaneously
builder.Services.AddOpenTelemetry().UseAzureMonitor();

Custom ActivitySource β€” Manual Spans

csharp
// Define ActivitySource (one per service or logical domain)
public static class OrdersTracing
{
    public static readonly ActivitySource Source =
        new("Orders.Api", "2.1.0");
}

// Create manual spans for business operations
public async Task<Result<OrderResponse>> Handle(
    CreateOrderCommand cmd, CancellationToken ct)
{
    // Start a new span β€” child of the current ambient activity
    using var activity = OrdersTracing.Source.StartActivity(
        "CreateOrder",
        ActivityKind.Internal);

    // Tag the span with business context
    activity?.SetTag("order.customer_id", cmd.CustomerId.ToString());
    activity?.SetTag("order.region",      cmd.Region);
    activity?.SetTag("order.item_count",  cmd.LineItems.Count);
    activity?.SetTag("order.channel",     cmd.Channel);

    try
    {
        // Nested span for inventory validation
        using var inventorySpan = OrdersTracing.Source.StartActivity(
            "ValidateInventory",
            ActivityKind.Client);
        inventorySpan?.SetTag("inventory.product_count", cmd.LineItems.Count);

        var inventory = await _inventoryClient.ValidateAsync(cmd.LineItems, ct);
        inventorySpan?.SetTag("inventory.all_available", inventory.AllAvailable);

        if (!inventory.AllAvailable)
        {
            activity?.SetStatus(ActivityStatusCode.Error, "Insufficient inventory");
            return Result.Failure<OrderResponse>("Some items are out of stock");
        }

        var order = Order.Create(cmd.CustomerId, cmd.LineItems);
        activity?.SetTag("order.id", order.Id.ToString());

        // Add an event (point-in-time annotation on the span)
        activity?.AddEvent(new ActivityEvent("OrderPersisted", tags:
            new ActivityTagsCollection { ["order.id"] = order.Id.ToString() }));

        activity?.SetStatus(ActivityStatusCode.Ok);
        return Result.Success(OrderResponse.From(order));
    }
    catch (Exception ex)
    {
        // Record exception on span
        activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
        activity?.RecordException(ex);
        throw;
    }
}

05bContext Propagation

For a distributed trace to work, the trace context must travel with the request across every hop β€” HTTP calls, message queue messages, gRPC calls. The W3C TraceContext standard defines how this works via HTTP headers. OTel handles this automatically for instrumented libraries.

text
// W3C traceparent header format
traceparent: 00-<trace-id>-<parent-span-id>-<flags>

// Example:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
//            ^  ^                                ^                ^
//            version  128-bit trace ID           64-bit span ID   flags (01=sampled)

// W3C tracestate β€” vendor-specific data
tracestate: az=<app-insights-specific-data>
csharp
// ── HttpClient propagates context automatically with OTel ───────────
// Nothing to do β€” OTel HttpClient instrumentation injects traceparent

// ── Manual propagation for Service Bus messages ───────────────────────
// Sender β€” inject current trace context into message properties
public async Task SendOrderEventAsync(OrderCreatedEvent evt)
{
    var message = new ServiceBusMessage(JsonSerializer.SerializeToUtf8Bytes(evt))
    {
        MessageId   = evt.OrderId.ToString(),
        ContentType = "application/json",
    };

    // Inject W3C trace context into message ApplicationProperties
    var propagator = Propagators.DefaultTextMapPropagator;
    propagator.Inject(
        new PropagationContext(Activity.Current?.Context ?? default, Baggage.Current),
        message.ApplicationProperties,
        (props, key, value) => props[key] = value);

    await _sender.SendMessageAsync(message);
}

// Receiver β€” extract trace context from message and set as parent
public async Task ProcessMessageAsync(ServiceBusReceivedMessage message)
{
    // Extract context from message properties
    var propagator = Propagators.DefaultTextMapPropagator;
    var parentContext = propagator.Extract(
        default,
        message.ApplicationProperties,
        (props, key) =>
        {
            if (props.TryGetValue(key, out var value))
                return [value?.ToString() ?? ""];
            return [];
        });

    // Start child span linked to the sender's trace
    using var activity = OrdersTracing.Source.StartActivity(
        "ServiceBus.ProcessOrder",
        ActivityKind.Consumer,
        parentContext.ActivityContext);

    activity?.SetTag("messaging.system",      "servicebus");
    activity?.SetTag("messaging.destination", "orders");
    activity?.SetTag("messaging.message_id",  message.MessageId);

    await ProcessOrderAsync(message.Body.ToString());
}

Baggage β€” Cross-Service Context

csharp
// Set baggage at entry point (API Gateway / BFF)
Baggage.Current = Baggage.SetBaggage("tenant.id",    request.TenantId);
Baggage.Current = Baggage.SetBaggage("user.id",      user.GetUserId());
Baggage.Current = Baggage.SetBaggage("feature.flags","beta-checkout=true");

// Read baggage anywhere in the call chain (propagated automatically)
var tenantId   = Baggage.GetBaggage("tenant.id");
var featureFlags = Baggage.GetBaggage("feature.flags");

// All spans in this trace automatically inherit baggage values
// Useful for: multi-tenant filtering, feature flag tracking,
//             A/B test attribution, compliance tagging

05cTrace Sampling Strategies

At scale, recording every single trace is prohibitively expensive β€” a service handling 10,000 requests per second generates terabytes of trace data daily. Sampling strategies let you keep costs manageable while preserving visibility into errors and slow requests. Head-based sampling decides at trace start whether to record (simple but may miss rare errors), while tail-based sampling buffers complete traces and keeps only interesting ones (errors, slow requests) β€” more powerful but requires additional infrastructure. The best production strategy combines ratio-based sampling for normal traffic with always-on sampling for critical paths like payments and checkout.

StrategyDecision PointProsCons
AlwaysOnHead (trace start)See everythingHuge data volume and cost at scale
AlwaysOffHeadZero costNo visibility at all β€” dev only
TraceIdRatio (N%)HeadPredictable volume, simple configMay miss infrequent errors
ParentBasedHeadRespects upstream sampling decisionCan't override a downstream service
Tail-Based (Jaeger)Tail (after trace)Always keeps error tracesRequires full trace buffering, more infra
Adaptive (App Insights)Head (dynamic)Auto-adjusts to traffic volumeProprietary β€” App Insights only
csharp
// ── Head-based sampling via OTel SDK ─────────────────────────────────

// 1. Always sample errors, sample 10% of successful traces
builder.Services.AddOpenTelemetry()
    .WithTracing(t => t
        .SetSampler(new CompositeApplicationSampler(
            // Always sample if there's an error on the span
            errorSampler: new AlwaysOnSampler(),
            // 10% of everything else
            defaultSampler: new TraceIdRatioBasedSampler(0.10)
        ))
    );

// Custom sampler β€” always sample slow requests
public class SlowRequestSampler : Sampler
{
    private readonly double _threshold;

    public SlowRequestSampler(double threshold) => _threshold = threshold;

    public override SamplingResult ShouldSample(in SamplingParameters parameters)
    {
        // Always sample if parent says so (preserve parent decision)
        if (parameters.ParentContext.TraceFlags.HasFlag(ActivityTraceFlags.Recorded))
            return new SamplingResult(SamplingDecision.RecordAndSample);

        // Sample based on URL priority
        var httpPath = parameters.Tags?
            .FirstOrDefault(t => t.Key == "http.target").Value?.ToString();

        // Always trace payment and checkout flows
        if (httpPath?.Contains("/checkout") == true ||
            httpPath?.Contains("/payments") == true)
            return new SamplingResult(SamplingDecision.RecordAndSample);

        // 5% sample for everything else
        return new SamplingResult(
            Random.Shared.NextDouble() < 0.05
                ? SamplingDecision.RecordAndSample
                : SamplingDecision.Drop);
    }
}

05dJaeger & Trace Backends

Your trace data needs a backend to store, index, and visualize it β€” and you have several options depending on your environment and budget. Azure Monitor (Application Insights) is the native choice for Azure workloads with zero infrastructure to manage, while Jaeger and Grafana Tempo are excellent open-source alternatives for local development, staging environments, or multi-cloud setups. The beauty of OpenTelemetry is that you can switch backends by changing exporter configuration alone β€” your instrumentation code stays the same. Use Jaeger locally via Docker for instant trace visualization during development, then export to Azure Monitor in production.

BackendTypeBest ForAzure Integration
Azure Monitor (App Insights)Managed SaaSFull Azure-native stack, E2E viewNative β€” UseAzureMonitor()
JaegerOpen sourceDev/staging, self-hosted tracingDeploy on AKS, OTLP ingest
Grafana TempoOpen sourceCost-effective, Grafana-nativeDeploy on AKS, query in Grafana
ZipkinOpen sourceLegacy, simple setupOTLP export via OTel collector
Datadog APMCommercial SaaSCross-cloud, ML anomaly detectionOTLP export to Datadog agent
HoneycombCommercial SaaSHigh-cardinality event analyticsOTLP export

Local Dev with Jaeger via Docker

yaml
# docker-compose.override.yml β€” dev environment
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC receiver
      - "4318:4318"     # OTLP HTTP receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true
    networks: [platform]

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4319:4317"     # Receive from apps
    networks: [platform]

# otel-collector-config.yaml
# receivers:
#   otlp:
#     protocols: { grpc: { endpoint: 0.0.0.0:4317 } }
# exporters:
#   jaeger:
#     endpoint: jaeger:14250
#   azuremonitor:
#     connection_string: "${AI_CONNECTION_STRING}"
# service:
#   pipelines:
#     traces:
#       receivers: [otlp]
#       exporters: [jaeger, azuremonitor]
json
// appsettings.Development.json β€” point to local Jaeger
{
  "Otel": {
    "Endpoint": "http://localhost:4317"
  },
  "ApplicationInsights": {
    "ConnectionString": ""
  }
}

// appsettings.Production.json β€” Azure Monitor
{
  "Otel": {
    "Endpoint": "https://eastus-8.in.applicationinsights.azure.com/"
  },
  "ApplicationInsights": {
    "ConnectionString": "@Microsoft.KeyVault(VaultName=myVault;SecretName=ai-conn-string)"
  }
}

06SLIs, SLOs & Error Budgets

SLOs (Service Level Objectives) are the bridge between engineering and business. They define what "good enough" looks like for your service in terms of measurable user experience.

TermDefinitionExample
SLI (Indicator)The metric you measureP99 latency of /api/orders POST
SLO (Objective)The target level for the SLIP99 latency &lt; 500ms for 99.5% of requests over 30 days
SLA (Agreement)Contract with users/customers β€” legal commitment99.9% uptime per month
Error BudgetSLO budget you can spend on incidents and deployments0.5% of requests allowed to fail in 30 days
Burn RateHow fast you're consuming error budget2x burn rate = budget gone in 15 days not 30

SLO Queries in KQL

kql
// ── Availability SLO β€” % of successful requests ──────────────────────
let sloWindow = 30d;
let sloTarget = 99.5; // 99.5% availability
requests
| where TimeGenerated > ago(sloWindow)
| where name !startswith "GET /health"   // Exclude health checks from SLO
| summarize
    Total   = count(),
    Success = countif(success == true)
| extend
    AvailabilitySLI = round(100.0 * Success / Total, 3),
    ErrorBudget_pct = 100.0 - sloTarget,
    ActualErrors    = Total - Success,
    BudgetedErrors  = round(Total * (1 - sloTarget / 100.0), 0),
    BudgetRemaining = round(Total * (1 - sloTarget / 100.0), 0) - (Total - Success)
| extend SLO_Met = AvailabilitySLI >= sloTarget

// ── Latency SLO β€” P99 must be under 500ms ─────────────────────────────
requests
| where TimeGenerated > ago(30d)
| where name == "POST /api/v1/orders"
| summarize
    TotalRequests     = count(),
    WithinTarget      = countif(duration < 500),
    P50_ms = percentile(duration, 50),
    P95_ms = percentile(duration, 95),
    P99_ms = percentile(duration, 99)
| extend
    SLI_latency = round(100.0 * WithinTarget / TotalRequests, 3),
    SLO_target  = 99.0,
    SLO_Met     = (100.0 * WithinTarget / TotalRequests) >= 99.0

// ── Error budget burn rate alert ──────────────────────────────────────
// Alert when burning through budget more than 2x faster than sustainable
let budget_pct   = 0.5;   // 0.5% of requests can fail
let burn_window  = 1h;
let total_window = 30d;
let totalRequests = toscalar(requests | where TimeGenerated > ago(total_window) | count);
let budgetTotal   = totalRequests * (budget_pct / 100.0);
let recentErrors  = toscalar(
    requests | where TimeGenerated > ago(burn_window) | where success == false | count);
let sustainableBurnPerHour = budgetTotal / (30.0 * 24.0);
recentErrors > sustainableBurnPerHour * 2  // Alert: burning 2x too fast

07Alerting Patterns & Runbooks

Good alerting is the difference between catching incidents in minutes versus hours β€” but bad alerting creates noise that trains your team to ignore pages. The golden rule is to alert on user-visible symptoms (error rate, latency, availability) rather than internal causes (CPU, memory), and every alert must have a clear runbook explaining what to do when it fires. Combine multi-window burn rate alerts for SLO-based detection with deadman's switches for silence detection, and tier severity levels so Sev1 pages on-call immediately while Sev3 creates a ticket for next business day.

🚨
Symptom-Based Alerting
Alert on user-visible symptoms (error rate, latency, availability) not internal causes (CPU, memory). Users don't care about CPU β€” they care about failed requests.
πŸ“Š
Multi-Window Burn Rate
Alert on error budget burn rate at two windows (1h fast-burn AND 6h slow-burn). Catches both sudden spikes and gradual degradation before the budget is exhausted.
πŸ”‡
Alert Fatigue Prevention
Every alert must be actionable with a clear runbook. Silence noisy alerts or increase thresholds. An alert that fires every day and nobody acts on is worse than no alert.
πŸ“
Runbook Links
Every alert includes a link to its runbook in the description β€” what the alert means, initial diagnosis steps, escalation path, and mitigation actions.
🌑️
Deadman's Switch
Alert when a metric stops being reported β€” e.g. if no requests arrive in 5 min, something is broken. Silence is often worse than an error.
🎯
Severity Tiering
Sev1: page on-call immediately (user impact now). Sev2: notify team (degradation). Sev3: ticket (investigate next business day). Match severity to actual impact.

Runbook Template

markdown
## Alert: High Error Rate β€” Orders API

**Severity:** Sev 1
**SLO Impact:** Yes β€” consuming error budget at >2x sustainable rate

### What this means
More than 5% of POST /api/v1/orders requests are failing over the last 5 minutes.
Users cannot place orders.

### Immediate Diagnosis (< 5 minutes)
1. Check Application Map in App Insights β€” which dependency is failing?
   Link: https://portal.azure.com/#resource/<ai-resource-id>/applicationMap

2. Run this query in Log Analytics:
   ```kql
   exceptions
   | where timestamp > ago(30m)
   | where cloud_RoleName == "orders-api"
   | summarize count() by outerMessage
   | order by count_ desc
   ```

3. Check recent deployments:
   Link: https://dev.azure.com/<org>/<project>/_release

### Common Causes & Fixes
| Symptom | Cause | Fix |
|---------|-------|-----|
| SQL timeout errors | DB CPU spike | Scale up SQL, check missing indexes |
| 503 from inventory-api | Inventory service down | Check inventory pod status in AKS |
| 401 from payment-api | Token expiry / MI issue | Restart pod, check MI role assignments |
| OutOfMemory | Memory leak in new deploy | Roll back deployment |

### Escalation
- Engineering on-call: PagerDuty β†’ orders-api rotation
- Incident commander: Slack #incidents

08Comparison & Decision Tables

With multiple overlapping services in the Azure monitoring stack, knowing which tool to reach for in each situation saves valuable time during incidents and planning. Application Insights is your APM layer for request-level performance, Log Analytics is your query engine for cross-service log correlation, Azure Monitor Metrics gives you real-time numeric dashboards, and distributed tracing shows the full request path across microservices. Use these decision tables as a quick reference when you need to answer a specific observability question β€” the right tool for the right job means faster resolution and better insights.

When to Use Each Service

QuestionAnswer / Service
I want to see all HTTP requests and response times for my APIApplication Insights β€” Requests table
I want to see the SQL queries my app is running and how slow they areApplication Insights β€” Dependencies table
I want to search across all service logs including platform eventsLog Analytics β€” KQL across all tables
I want a chart of CPU and memory over timeAzure Monitor Metrics β€” metric explorer
I want to be paged when error rate exceeds 5%Azure Monitor Alert Rule (log search or metric)
I want to see the full path of one slow request across 5 microservicesDistributed Tracing β€” App Insights E2E or Jaeger
I want to track a custom business event like 'OrderPlaced'Application Insights β€” TrackEvent / customEvents
I want to query Key Vault access audit logsLog Analytics β€” AzureDiagnostics where ResourceType==VAULTS
I want a dashboard showing SLOs for 5 servicesAzure Monitor Workbook or Grafana
I want to detect anomalous exception spikes automaticallyApplication Insights Smart Detection
I want Prometheus metrics for my K8s workloadsAzure Monitor Metrics + Container Insights + Prometheus scraping

Logging Levels β€” What to Log at Each Level

LevelILogger MethodUse ForProduction Volume
CriticalLogCriticalApp cannot continue β€” unrecoverable failureRare β€” immediate page
ErrorLogErrorOperation failed β€” exception, 5xx, data corruptionLow β€” alert on any
WarningLogWarningUnexpected but handled β€” retry, DLQ, validation failMedium β€” trend on
InformationLogInformationNormal business flow β€” request received, order createdHigh β€” sample in prod
DebugLogDebugDiagnostic detail β€” cache hit/miss, intermediate valuesNone in prod
TraceLogTraceVery verbose β€” method entry/exit, every loop iterationNever in prod

09Quick Reference Cheat Sheet

This cheat sheet distills the most commonly used KQL patterns and OpenTelemetry setup snippets into copy-paste-ready blocks for daily use. Keep these patterns handy during incident response when you need to quickly query error rates, latency percentiles, or trace a specific operation across services. The NuGet package reference at the bottom ensures you always know exactly which packages to install for each instrumentation scenario β€” from basic ASP.NET Core auto-instrumentation to full OTel with Prometheus metrics export.

Essential KQL Patterns
kql
// Always start with time filter
| where TimeGenerated > ago(1h)

// Error rate
| summarize ErrorRate = countif(success==false)*100.0/count() by bin(TimeGenerated,5m)

// Latency percentiles
| summarize P50=percentile(duration,50), P95=percentile(duration,95), P99=percentile(duration,99) by name

// Top N
| top 20 by Count desc

// Count distinct
| summarize UniqueUsers = dcount(user_Id)

// Dynamic property
| extend Region = tostring(customDimensions["Region"])

// Time render
| render timechart

// Cross AI resource query
app("MyOtherAppInsights").requests | where timestamp > ago(1h)

// Parse structured log message
| parse Message with "OrderId=" OrderId " status=" Status
OTel .NET 8 Minimal Setup
csharp
// Minimal OTel setup for a microservice
builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("my-service", "1.0.0"))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("MyService")
        .AddOtlpExporter())
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddMeter("MyService")
        .AddOtlpExporter())
    .UseAzureMonitor(); // Also send to App Insights

// Custom span
using var activity = MySource.StartActivity("DoSomething");
activity?.SetTag("key", "value");
activity?.AddEvent(new ActivityEvent("StepCompleted"));

// Custom metric
MyCounter.Add(1, new KeyValuePair<string,object?>("region","EU"));
NuGet PackagePurpose
Microsoft.ApplicationInsights.AspNetCoreClassic App Insights SDK for ASP.NET Core
Azure.Monitor.OpenTelemetry.AspNetCoreModern OTel-based Azure Monitor exporter (recommended)
OpenTelemetry.Extensions.HostingOTel host builder extensions
OpenTelemetry.Instrumentation.AspNetCoreAuto-instrument HTTP requests
OpenTelemetry.Instrumentation.HttpAuto-instrument HttpClient calls
OpenTelemetry.Instrumentation.EntityFrameworkCoreAuto-instrument EF Core SQL queries
OpenTelemetry.Instrumentation.RuntimeGC, thread pool, heap metrics
OpenTelemetry.Exporter.OpenTelemetryProtocolOTLP exporter (Jaeger, Tempo, Collector)
OpenTelemetry.Exporter.Prometheus.AspNetCore/metrics endpoint for Prometheus scraping
Serilog.AspNetCore + Serilog.Sinks.ApplicationInsightsStructured logging to App Insights via Serilog
Microsoft.ApplicationInsights.WorkerServiceApp Insights for background services and workers