Azure Monitor — Distributed Tracing and Correlation IDs

Why Distributed Tracing Matters

A single user request in a microservices architecture traverses multiple services. Without distributed tracing you cannot:

Debug failures — When Service D fails, trace back through A → B → C to find root cause
Analyze latency — Break down a 3-second request into time spent at each hop
Map dependencies — Discover actual service communication patterns

Scenario	Without Tracing	With Tracing
Production incident	Hours of log correlation	Minutes to pinpoint
Performance regression	Guess-and-check	Immediate bottleneck ID
Dependency failure	Unclear blast radius	Exact impact mapping

Architecture: Trace Propagation

┌───────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TRACE FLOW                         │
│                                                                   │
│  traceparent: 00-<trace-id>-<span-id>-01                          │
│                                                                   │
│  ┌──────────┐   HTTP    ┌──────────┐  Service Bus  ┌──────────┐   │
│  │ Frontend │──────────▶│ Order API│──────────────▶│ Payment  │   │
│  │  (SPA)   │           │  (.NET)  │               │ (Node.js)│   │
│  └──────────┘           └────┬─────┘               └──────────┘   │
│                              │ HTTP                               │
│                         ┌────▼─────┐                              │
│                         │Inventory │                              │
│                         │ Service  │                              │
│                         └────┬─────┘                              │
│                              │ Event Grid                         │
│                         ┌────▼─────┐                              │
│                         │  Audit   │                              │
│                         │ Service  │                              │
│                         └──────────┘                              │
│                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │           Azure Monitor / Application Insights              │  │
│  │  requests | dependencies | traces | exceptions              │  │
│  │  All correlated by operation_Id (trace-id)                  │  │
│  └─────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────┘

Understanding W3C Trace Context

The traceparent Header

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ──  ────────────────────────────────  ────────────────  ──
          version         trace-id                    span-id      flags
                       (operation_Id)              (parent-id)   (sampled)

The tracestate header carries vendor-specific data (e.g., azure=<correlation-data>).

Application Insights Correlation Model

Concept	W3C Field	App Insights Field	Purpose
Operation ID	trace-id	`operation_Id`	Groups all telemetry for one transaction
Parent ID	parent span-id	`operation_ParentId`	Links child to parent span
Request ID	span-id	`id`	Unique ID for this operation

Auto-Correlation Flow

Inbound — SDK reads traceparent, extracts trace-id as operation_Id
Processing — Creates new span-id, sets parent-id to incoming span-id
Outbound — Injects updated traceparent into outgoing HTTP/messaging calls

Step-by-Step Implementation

Application Insights SDK (.NET)

dotnet add package Microsoft.ApplicationInsights.AspNetCore

// Program.cs
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});
var app = builder.Build();

Application Insights SDK (Node.js)

npm install applicationinsights

// instrumentation.ts — import FIRST in entry point
import * as appInsights from "applicationinsights";

appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
  .setAutoCollectRequests(true)
  .setAutoCollectDependencies(true)
  .setAutoCollectExceptions(true)
  .setDistributedTracingMode(appInsights.DistributedTracingModes.AI_AND_W3C)
  .start();

export const telemetryClient = appInsights.defaultClient;

OpenTelemetry Integration

dotnet add package Azure.Monitor.OpenTelemetry.AspNetCore

// .NET — OpenTelemetry with Azure Monitor exporter
using Azure.Monitor.OpenTelemetry.AspNetCore;

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().UseAzureMonitor(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});

// Node.js — OpenTelemetry
import { useAzureMonitor } from "@azure/monitor-opentelemetry";

useAzureMonitor({
  azureMonitorExporterOptions: {
    connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING,
  },
});

Custom Telemetry with TelemetryClient

public class OrderService(TelemetryClient telemetry)
{
    public async Task<Order> ProcessOrder(string orderId)
    {
        using var operation = telemetry.StartOperation<RequestTelemetry>("ProcessOrder");
        operation.Telemetry.Properties["OrderId"] = orderId;
        try
        {
            var order = await _repository.GetAsync(orderId);
            telemetry.TrackEvent("OrderProcessed", new Dictionary<string, string>
            {
                ["OrderId"] = orderId, ["Amount"] = order.Total.ToString()
            });
            return order;
        }
        catch (Exception ex)
        {
            telemetry.TrackException(ex);
            operation.Telemetry.Success = false;
            throw;
        }
    }
}

Propagating Correlation Across Messaging

HTTP correlation is automatic. For Service Bus and Event Grid, propagation requires awareness:

// Service Bus — SDK auto-propagates Diagnostic-Id in ApplicationProperties
var sender = serviceBusClient.CreateSender("orders");
await sender.SendMessageAsync(new ServiceBusMessage(JsonSerializer.Serialize(order))
{
    ApplicationProperties = { ["OrderId"] = order.Id }
});
// Diagnostic-Id is automatically added by Azure.Messaging.ServiceBus

// Event Grid — embed trace context in event data
var eventData = new { Order = order, TraceId = Activity.Current?.TraceId.ToString() };
await eventGridClient.SendEventAsync(
    new EventGridEvent("orders/created", "Order.Created", "1.0", eventData));

Advanced Correlation Patterns

Service Bus Consumer Correlation

public async Task ProcessMessageAsync(ProcessMessageEventArgs args)
{
    // Azure.Messaging.ServiceBus v7+ auto-links via Diagnostic-Id
    // For manual control:
    var diagnosticId = args.Message.ApplicationProperties
        .TryGetValue("Diagnostic-Id", out var id) ? id.ToString() : null;

    using var operation = _telemetry.StartOperation<RequestTelemetry>(
        "ProcessOrderMessage",
        operationId: diagnosticId?.Split('-')[1],
        parentOperationId: diagnosticId?.Split('-')[2]);

    operation.Telemetry.Properties["OrderId"] =
        args.Message.ApplicationProperties["OrderId"]?.ToString();

    await HandleOrder(args.Message);
}

Event Grid Consumer Correlation

[Function("HandleOrderEvent")]
public async Task Run([EventGridTrigger] EventGridEvent evt)
{
    var data = evt.Data.ToObjectFromJson<OrderEventData>();
    using var operation = _telemetry.StartOperation<RequestTelemetry>(
        "HandleOrderEvent",
        operationId: data.TraceId);

    operation.Telemetry.Properties["EventType"] = evt.EventType;
    await ProcessEvent(data);
}

Background Job Correlation

protected override async Task ExecuteAsync(CancellationToken ct)
{
    while (!ct.IsCancellationRequested)
    {
        using var operation = _telemetry.StartOperation<RequestTelemetry>("InventorySync");
        operation.Telemetry.Properties["JobType"] = "Scheduled";
        try
        {
            var count = await SyncInventory(ct);
            operation.Telemetry.Metrics["ItemsSynced"] = count;
        }
        catch (Exception ex)
        {
            _telemetry.TrackException(ex);
            operation.Telemetry.Success = false;
        }
        await Task.Delay(TimeSpan.FromMinutes(5), ct);
    }
}

Custom Multi-Step Operation Tracking

public async Task<OrderResult> PlaceOrder(OrderRequest request)
{
    using var operation = _telemetry.StartOperation<RequestTelemetry>("PlaceOrder");
    operation.Telemetry.Properties["CustomerId"] = request.CustomerId;

    using (_telemetry.StartOperation<DependencyTelemetry>("ValidateOrder"))
        await _validator.Validate(request);

    using (var dep = _telemetry.StartOperation<DependencyTelemetry>("ReserveInventory"))
    {
        dep.Telemetry.Target = "inventory-service";
        await _inventory.Reserve(request.Items);
    }

    using (var dep = _telemetry.StartOperation<DependencyTelemetry>("ProcessPayment"))
    {
        dep.Telemetry.Target = "payment-gateway";
        await _payment.Charge(request.PaymentMethod, request.Total);
    }
    return new OrderResult { Success = true };
}

KQL Queries for Distributed Tracing

End-to-End Transaction Search

// All telemetry for a specific transaction
let opId = "4bf92f3577b34da6a3ce929d0e0e4736";
union requests, dependencies, traces, exceptions
| where operation_Id == opId
| project timestamp, itemType, name, duration, success, customDimensions
| order by timestamp asc

// Find transactions by business context
union requests, dependencies, traces
| where customDimensions.OrderId == "ORD-2024-001"
| order by timestamp asc

Dependency Chain Analysis

// Full dependency chain for slow requests
requests
| where name == "POST /api/orders" and duration > 3000
| take 10
| join kind=inner (
    dependencies | project operation_Id, dep = name, target, dep_duration = duration
) on operation_Id
| summarize AvgTime = avg(dep_duration), Calls = count() by target, dep
| order by AvgTime desc

Latency Breakdown by Service

dependencies
| where timestamp > ago(1h)
| summarize P50 = percentile(duration, 50), P95 = percentile(duration, 95),
    P99 = percentile(duration, 99) by target, name
| order by P95 desc

Failed Request Correlation

requests
| where success == false and timestamp > ago(1h)
| join kind=inner (
    dependencies | where success == false
    | project operation_Id, failed_dep = name, dep_target = target, resultCode
) on operation_Id
| summarize Failures = count() by failed_dep, dep_target, resultCode
| order by Failures desc

Application Map Data

dependencies
| where timestamp > ago(1h)
| summarize Calls = count(), Failures = countif(success == false),
    AvgMs = avg(duration) by source = cloud_RoleName, target, type
| extend FailRate = round(100.0 * Failures / Calls, 1)
| where FailRate > 5 or AvgMs > 2000

Custom Dimensions and Metrics

Business Context via TelemetryInitializer

public class BusinessContextInitializer : ITelemetryInitializer
{
    private readonly IHttpContextAccessor _http;
    public BusinessContextInitializer(IHttpContextAccessor http) => _http = http;

    public void Initialize(ITelemetry telemetry)
    {
        var ctx = _http.HttpContext;
        if (ctx == null) return;
        var props = (telemetry as ISupportProperties)?.Properties;
        if (props == null) return;

        if (ctx.Request.RouteValues.TryGetValue("orderId", out var oid))
            props["OrderId"] = oid?.ToString();
        var cid = ctx.User?.FindFirst("customer_id")?.Value;
        if (cid != null) props["CustomerId"] = cid;
    }
}

// Register: builder.Services.AddSingleton<ITelemetryInitializer, BusinessContextInitializer>();

Custom Metrics for SLA Tracking

public void TrackOrderLatency(TimeSpan duration)
{
    _telemetry.GetMetric("OrderProcessingDuration").TrackValue(duration.TotalMilliseconds);
    var slaThreshold = TimeSpan.FromSeconds(5);
    _telemetry.GetMetric("OrderSlaCompliance").TrackValue(duration <= slaThreshold ? 1 : 0);
}

Sampling Strategies

// Adaptive (default) — auto-adjusts to ~5 items/sec
builder.Services.Configure<TelemetryConfiguration>(config =>
{
    config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .UseAdaptiveSampling(maxTelemetryItemsPerSecond: 20, excludedTypes: "Exception;Event");
});

// Fixed-rate — predictable volume
builder.Services.Configure<TelemetryConfiguration>(config =>
{
    config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .UseSampling(25); // 25% of telemetry
});

# Ingestion sampling via CLI
az monitor app-insights component update \
  --app "my-app-insights" \
  --resource-group "rg-monitoring" \
  --sampling-percentage 50

Alerting on Trace Data

Smart Detection

Enabled by default. Automatically detects abnormal failure rates, performance degradation, and dependency issues.

Custom Alert Rules

# Alert: error rate > 5% in 5-minute window
az monitor scheduled-query create \
  --name "HighErrorRate-OrderService" \
  --resource-group "rg-monitoring" \
  --scopes "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/microsoft.insights/components/my-app-insights" \
  --condition "count > 50" \
  --condition-query "requests | where cloud_RoleName == 'order-service' | where success == false | where timestamp > ago(5m)" \
  --evaluation-frequency "5m" \
  --window-size "5m" \
  --severity 2 \
  --action-groups "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/microsoft.insights/actionGroups/ops-team"

Failure Anomaly Detection (KQL)

// Cascading failures — multiple downstream services failing simultaneously
dependencies
| where timestamp > ago(10m) and success == false
| summarize FailedTargets = dcount(target), Total = count() by cloud_RoleName
| where FailedTargets >= 3

Real-World Debugging Scenarios

Finding the Slow Service

Symptom: Checkout takes 8+ seconds.

// Break down time by dependency
requests
| where name == "POST /api/checkout" and duration > 5000
| join kind=inner (
    dependencies | project operation_Id, target, dep_duration = duration
) on operation_Id
| summarize Avg = avg(dep_duration), P95 = percentile(dep_duration, 95) by target
| order by Avg desc

Resolution: Inventory service averaging 6.2s due to N+1 database queries.

Identifying Intermittent Failures

Symptom: 2% of orders fail randomly.

requests
| where name == "POST /api/orders" and success == false
| join kind=inner (dependencies | where success == false) on operation_Id
| summarize Failures = count() by target, resultCode
| order by Failures desc

Resolution: Payment gateway returning 503 during nightly maintenance. Fixed with retry + exponential backoff.

Correlating Frontend to Backend

// Frontend — enable cross-origin correlation
const appInsights = new ApplicationInsights({
  config: {
    connectionString: "...",
    enableCorsCorrelation: true,
    correlationHeaderDomains: ["api.myapp.com"],
  },
});
appInsights.loadAppInsights();

// Link slow page views to backend operations
pageViews
| where duration > 10000
| join kind=inner (requests) on operation_Id
| project user_Id, frontend_duration = duration, backend_name = name1, backend_duration = duration1

Resolution: Backend returned 200 in 2s but 4MB payload timed out on mobile. Fixed with pagination.

Best Practices

Practice	Rationale
Set `cloud_RoleName` on every service	Required for Application Map
Use W3C Trace Context (SDK default)	Cross-vendor standard
Add business IDs as custom dimensions	Query by OrderId, CustomerId
Never sample exceptions	Always need full error context
Exclude health checks from telemetry	Reduces noise
Use OpenTelemetry for new projects	Vendor-neutral, future-proof

Infrastructure Setup

# Create Application Insights with Log Analytics workspace
az monitor app-insights component create \
  --app "my-app-insights" \
  --location "eastus" \
  --resource-group "rg-monitoring" \
  --kind "web" \
  --workspace "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/my-workspace"

# Get connection string
az monitor app-insights component show \
  --app "my-app-insights" \
  --resource-group "rg-monitoring" \
  --query "connectionString" -o tsv

# Set daily cap for cost control
az monitor app-insights component billing update \
  --app "my-app-insights" \
  --resource-group "rg-monitoring" \
  --cap 5

Common Pitfalls

Broken correlation in async — Task.Run loses Activity context. Use async/await throughout.
Missing Diagnostic-Id — Older Service Bus SDKs don't auto-propagate. Use v7+.
No cloud_RoleName — Application Map shows one blob instead of service graph.
Logging PII in dimensions — Custom dimensions are stored in plain text.
100% sampling in production — Use adaptive sampling for high-traffic services.

Summary

Capability	Implementation
Auto-correlation	App Insights SDK + W3C Trace Context
Cross-service tracing	`traceparent` (HTTP), `Diagnostic-Id` (Service Bus)
Business context	Custom dimensions via TelemetryInitializer
Query & analysis	KQL across requests, dependencies, traces, exceptions
Alerting	Scheduled queries + Smart Detection
Cost control	Adaptive sampling + daily caps

Start with the SDK for automatic correlation, add business context through custom dimensions, query with KQL for insights, and alert on anomalies. The combination gives you complete visibility — turning "something is slow" into "the inventory service SQL query is missing an index."