← Back to ArticlesAzure Monitor

Azure Monitor — Distributed Tracing and Correlation IDs

Implementing distributed tracing across services, correlation ID propagation, and building end-to-end request visibility.

Azure Monitor — Distributed Tracing and Correlation IDs

Why Distributed Tracing Matters

A single user request in a microservices architecture traverses multiple services. Without distributed tracing you cannot:

ScenarioWithout TracingWith Tracing
Production incidentHours of log correlationMinutes to pinpoint
Performance regressionGuess-and-checkImmediate bottleneck ID
Dependency failureUnclear blast radiusExact impact mapping

Architecture: Trace Propagation

┌───────────────────────────────────────────────────────────────────┐
│                    DISTRIBUTED TRACE FLOW                         │
│                                                                   │
│  traceparent: 00-<trace-id>-<span-id>-01                          │
│                                                                   │
│  ┌──────────┐   HTTP    ┌──────────┐  Service Bus  ┌──────────┐   │
│  │ Frontend │──────────▶│ Order API│──────────────▶│ Payment  │   │
│  │  (SPA)   │           │  (.NET)  │               │ (Node.js)│   │
│  └──────────┘           └────┬─────┘               └──────────┘   │
│                              │ HTTP                               │
│                         ┌────▼─────┐                              │
│                         │Inventory │                              │
│                         │ Service  │                              │
│                         └────┬─────┘                              │
│                              │ Event Grid                         │
│                         ┌────▼─────┐                              │
│                         │  Audit   │                              │
│                         │ Service  │                              │
│                         └──────────┘                              │
│                                                                   │
│  ┌─────────────────────────────────────────────────────────────┐  │
│  │           Azure Monitor / Application Insights              │  │
│  │  requests | dependencies | traces | exceptions              │  │
│  │  All correlated by operation_Id (trace-id)                  │  │
│  └─────────────────────────────────────────────────────────────┘  │
└───────────────────────────────────────────────────────────────────┘

Understanding W3C Trace Context

The traceparent Header

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
             ──  ────────────────────────────────  ────────────────  ──
          version         trace-id                    span-id      flags
                       (operation_Id)              (parent-id)   (sampled)

The tracestate header carries vendor-specific data (e.g., azure=<correlation-data>).

Application Insights Correlation Model

ConceptW3C FieldApp Insights FieldPurpose
Operation IDtrace-idoperation_IdGroups all telemetry for one transaction
Parent IDparent span-idoperation_ParentIdLinks child to parent span
Request IDspan-ididUnique ID for this operation

Auto-Correlation Flow

  1. Inbound — SDK reads traceparent, extracts trace-id as operation_Id
  2. Processing — Creates new span-id, sets parent-id to incoming span-id
  3. Outbound — Injects updated traceparent into outgoing HTTP/messaging calls

Step-by-Step Implementation

Application Insights SDK (.NET)

dotnet add package Microsoft.ApplicationInsights.AspNetCore
// Program.cs
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddApplicationInsightsTelemetry(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});
var app = builder.Build();

Application Insights SDK (Node.js)

npm install applicationinsights
// instrumentation.ts — import FIRST in entry point
import * as appInsights from "applicationinsights";

appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
  .setAutoCollectRequests(true)
  .setAutoCollectDependencies(true)
  .setAutoCollectExceptions(true)
  .setDistributedTracingMode(appInsights.DistributedTracingModes.AI_AND_W3C)
  .start();

export const telemetryClient = appInsights.defaultClient;

OpenTelemetry Integration

dotnet add package Azure.Monitor.OpenTelemetry.AspNetCore
// .NET — OpenTelemetry with Azure Monitor exporter
using Azure.Monitor.OpenTelemetry.AspNetCore;

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().UseAzureMonitor(options =>
{
    options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});
// Node.js — OpenTelemetry
import { useAzureMonitor } from "@azure/monitor-opentelemetry";

useAzureMonitor({
  azureMonitorExporterOptions: {
    connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING,
  },
});

Custom Telemetry with TelemetryClient

public class OrderService(TelemetryClient telemetry)
{
    public async Task<Order> ProcessOrder(string orderId)
    {
        using var operation = telemetry.StartOperation<RequestTelemetry>("ProcessOrder");
        operation.Telemetry.Properties["OrderId"] = orderId;
        try
        {
            var order = await _repository.GetAsync(orderId);
            telemetry.TrackEvent("OrderProcessed", new Dictionary<string, string>
            {
                ["OrderId"] = orderId, ["Amount"] = order.Total.ToString()
            });
            return order;
        }
        catch (Exception ex)
        {
            telemetry.TrackException(ex);
            operation.Telemetry.Success = false;
            throw;
        }
    }
}

Propagating Correlation Across Messaging

HTTP correlation is automatic. For Service Bus and Event Grid, propagation requires awareness:

// Service Bus — SDK auto-propagates Diagnostic-Id in ApplicationProperties
var sender = serviceBusClient.CreateSender("orders");
await sender.SendMessageAsync(new ServiceBusMessage(JsonSerializer.Serialize(order))
{
    ApplicationProperties = { ["OrderId"] = order.Id }
});
// Diagnostic-Id is automatically added by Azure.Messaging.ServiceBus
// Event Grid — embed trace context in event data
var eventData = new { Order = order, TraceId = Activity.Current?.TraceId.ToString() };
await eventGridClient.SendEventAsync(
    new EventGridEvent("orders/created", "Order.Created", "1.0", eventData));

Advanced Correlation Patterns

Service Bus Consumer Correlation

public async Task ProcessMessageAsync(ProcessMessageEventArgs args)
{
    // Azure.Messaging.ServiceBus v7+ auto-links via Diagnostic-Id
    // For manual control:
    var diagnosticId = args.Message.ApplicationProperties
        .TryGetValue("Diagnostic-Id", out var id) ? id.ToString() : null;

    using var operation = _telemetry.StartOperation<RequestTelemetry>(
        "ProcessOrderMessage",
        operationId: diagnosticId?.Split('-')[1],
        parentOperationId: diagnosticId?.Split('-')[2]);

    operation.Telemetry.Properties["OrderId"] =
        args.Message.ApplicationProperties["OrderId"]?.ToString();

    await HandleOrder(args.Message);
}

Event Grid Consumer Correlation

[Function("HandleOrderEvent")]
public async Task Run([EventGridTrigger] EventGridEvent evt)
{
    var data = evt.Data.ToObjectFromJson<OrderEventData>();
    using var operation = _telemetry.StartOperation<RequestTelemetry>(
        "HandleOrderEvent",
        operationId: data.TraceId);

    operation.Telemetry.Properties["EventType"] = evt.EventType;
    await ProcessEvent(data);
}

Background Job Correlation

protected override async Task ExecuteAsync(CancellationToken ct)
{
    while (!ct.IsCancellationRequested)
    {
        using var operation = _telemetry.StartOperation<RequestTelemetry>("InventorySync");
        operation.Telemetry.Properties["JobType"] = "Scheduled";
        try
        {
            var count = await SyncInventory(ct);
            operation.Telemetry.Metrics["ItemsSynced"] = count;
        }
        catch (Exception ex)
        {
            _telemetry.TrackException(ex);
            operation.Telemetry.Success = false;
        }
        await Task.Delay(TimeSpan.FromMinutes(5), ct);
    }
}

Custom Multi-Step Operation Tracking

public async Task<OrderResult> PlaceOrder(OrderRequest request)
{
    using var operation = _telemetry.StartOperation<RequestTelemetry>("PlaceOrder");
    operation.Telemetry.Properties["CustomerId"] = request.CustomerId;

    using (_telemetry.StartOperation<DependencyTelemetry>("ValidateOrder"))
        await _validator.Validate(request);

    using (var dep = _telemetry.StartOperation<DependencyTelemetry>("ReserveInventory"))
    {
        dep.Telemetry.Target = "inventory-service";
        await _inventory.Reserve(request.Items);
    }

    using (var dep = _telemetry.StartOperation<DependencyTelemetry>("ProcessPayment"))
    {
        dep.Telemetry.Target = "payment-gateway";
        await _payment.Charge(request.PaymentMethod, request.Total);
    }
    return new OrderResult { Success = true };
}

KQL Queries for Distributed Tracing

End-to-End Transaction Search

// All telemetry for a specific transaction
let opId = "4bf92f3577b34da6a3ce929d0e0e4736";
union requests, dependencies, traces, exceptions
| where operation_Id == opId
| project timestamp, itemType, name, duration, success, customDimensions
| order by timestamp asc
// Find transactions by business context
union requests, dependencies, traces
| where customDimensions.OrderId == "ORD-2024-001"
| order by timestamp asc

Dependency Chain Analysis

// Full dependency chain for slow requests
requests
| where name == "POST /api/orders" and duration > 3000
| take 10
| join kind=inner (
    dependencies | project operation_Id, dep = name, target, dep_duration = duration
) on operation_Id
| summarize AvgTime = avg(dep_duration), Calls = count() by target, dep
| order by AvgTime desc

Latency Breakdown by Service

dependencies
| where timestamp > ago(1h)
| summarize P50 = percentile(duration, 50), P95 = percentile(duration, 95),
    P99 = percentile(duration, 99) by target, name
| order by P95 desc

Failed Request Correlation

requests
| where success == false and timestamp > ago(1h)
| join kind=inner (
    dependencies | where success == false
    | project operation_Id, failed_dep = name, dep_target = target, resultCode
) on operation_Id
| summarize Failures = count() by failed_dep, dep_target, resultCode
| order by Failures desc

Application Map Data

dependencies
| where timestamp > ago(1h)
| summarize Calls = count(), Failures = countif(success == false),
    AvgMs = avg(duration) by source = cloud_RoleName, target, type
| extend FailRate = round(100.0 * Failures / Calls, 1)
| where FailRate > 5 or AvgMs > 2000

Custom Dimensions and Metrics

Business Context via TelemetryInitializer

public class BusinessContextInitializer : ITelemetryInitializer
{
    private readonly IHttpContextAccessor _http;
    public BusinessContextInitializer(IHttpContextAccessor http) => _http = http;

    public void Initialize(ITelemetry telemetry)
    {
        var ctx = _http.HttpContext;
        if (ctx == null) return;
        var props = (telemetry as ISupportProperties)?.Properties;
        if (props == null) return;

        if (ctx.Request.RouteValues.TryGetValue("orderId", out var oid))
            props["OrderId"] = oid?.ToString();
        var cid = ctx.User?.FindFirst("customer_id")?.Value;
        if (cid != null) props["CustomerId"] = cid;
    }
}

// Register: builder.Services.AddSingleton<ITelemetryInitializer, BusinessContextInitializer>();

Custom Metrics for SLA Tracking

public void TrackOrderLatency(TimeSpan duration)
{
    _telemetry.GetMetric("OrderProcessingDuration").TrackValue(duration.TotalMilliseconds);
    var slaThreshold = TimeSpan.FromSeconds(5);
    _telemetry.GetMetric("OrderSlaCompliance").TrackValue(duration <= slaThreshold ? 1 : 0);
}

Sampling Strategies

// Adaptive (default) — auto-adjusts to ~5 items/sec
builder.Services.Configure<TelemetryConfiguration>(config =>
{
    config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .UseAdaptiveSampling(maxTelemetryItemsPerSecond: 20, excludedTypes: "Exception;Event");
});

// Fixed-rate — predictable volume
builder.Services.Configure<TelemetryConfiguration>(config =>
{
    config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .UseSampling(25); // 25% of telemetry
});
# Ingestion sampling via CLI
az monitor app-insights component update \
  --app "my-app-insights" \
  --resource-group "rg-monitoring" \
  --sampling-percentage 50

Alerting on Trace Data

Smart Detection

Enabled by default. Automatically detects abnormal failure rates, performance degradation, and dependency issues.

Custom Alert Rules

# Alert: error rate > 5% in 5-minute window
az monitor scheduled-query create \
  --name "HighErrorRate-OrderService" \
  --resource-group "rg-monitoring" \
  --scopes "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/microsoft.insights/components/my-app-insights" \
  --condition "count > 50" \
  --condition-query "requests | where cloud_RoleName == 'order-service' | where success == false | where timestamp > ago(5m)" \
  --evaluation-frequency "5m" \
  --window-size "5m" \
  --severity 2 \
  --action-groups "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/microsoft.insights/actionGroups/ops-team"

Failure Anomaly Detection (KQL)

// Cascading failures — multiple downstream services failing simultaneously
dependencies
| where timestamp > ago(10m) and success == false
| summarize FailedTargets = dcount(target), Total = count() by cloud_RoleName
| where FailedTargets >= 3

Real-World Debugging Scenarios

Finding the Slow Service

Symptom: Checkout takes 8+ seconds.

// Break down time by dependency
requests
| where name == "POST /api/checkout" and duration > 5000
| join kind=inner (
    dependencies | project operation_Id, target, dep_duration = duration
) on operation_Id
| summarize Avg = avg(dep_duration), P95 = percentile(dep_duration, 95) by target
| order by Avg desc

Resolution: Inventory service averaging 6.2s due to N+1 database queries.

Identifying Intermittent Failures

Symptom: 2% of orders fail randomly.

requests
| where name == "POST /api/orders" and success == false
| join kind=inner (dependencies | where success == false) on operation_Id
| summarize Failures = count() by target, resultCode
| order by Failures desc

Resolution: Payment gateway returning 503 during nightly maintenance. Fixed with retry + exponential backoff.

Correlating Frontend to Backend

// Frontend — enable cross-origin correlation
const appInsights = new ApplicationInsights({
  config: {
    connectionString: "...",
    enableCorsCorrelation: true,
    correlationHeaderDomains: ["api.myapp.com"],
  },
});
appInsights.loadAppInsights();
// Link slow page views to backend operations
pageViews
| where duration > 10000
| join kind=inner (requests) on operation_Id
| project user_Id, frontend_duration = duration, backend_name = name1, backend_duration = duration1

Resolution: Backend returned 200 in 2s but 4MB payload timed out on mobile. Fixed with pagination.


Best Practices

PracticeRationale
Set cloud_RoleName on every serviceRequired for Application Map
Use W3C Trace Context (SDK default)Cross-vendor standard
Add business IDs as custom dimensionsQuery by OrderId, CustomerId
Never sample exceptionsAlways need full error context
Exclude health checks from telemetryReduces noise
Use OpenTelemetry for new projectsVendor-neutral, future-proof

Infrastructure Setup

# Create Application Insights with Log Analytics workspace
az monitor app-insights component create \
  --app "my-app-insights" \
  --location "eastus" \
  --resource-group "rg-monitoring" \
  --kind "web" \
  --workspace "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/my-workspace"

# Get connection string
az monitor app-insights component show \
  --app "my-app-insights" \
  --resource-group "rg-monitoring" \
  --query "connectionString" -o tsv

# Set daily cap for cost control
az monitor app-insights component billing update \
  --app "my-app-insights" \
  --resource-group "rg-monitoring" \
  --cap 5

Common Pitfalls

  1. Broken correlation in asyncTask.Run loses Activity context. Use async/await throughout.
  2. Missing Diagnostic-Id — Older Service Bus SDKs don't auto-propagate. Use v7+.
  3. No cloud_RoleName — Application Map shows one blob instead of service graph.
  4. Logging PII in dimensions — Custom dimensions are stored in plain text.
  5. 100% sampling in production — Use adaptive sampling for high-traffic services.

Summary

CapabilityImplementation
Auto-correlationApp Insights SDK + W3C Trace Context
Cross-service tracingtraceparent (HTTP), Diagnostic-Id (Service Bus)
Business contextCustom dimensions via TelemetryInitializer
Query & analysisKQL across requests, dependencies, traces, exceptions
AlertingScheduled queries + Smart Detection
Cost controlAdaptive sampling + daily caps

Start with the SDK for automatic correlation, add business context through custom dimensions, query with KQL for insights, and alert on anomalies. The combination gives you complete visibility — turning "something is slow" into "the inventory service SQL query is missing an index."