Azure Monitor — Distributed Tracing and Correlation IDs
Why Distributed Tracing Matters
A single user request in a microservices architecture traverses multiple services. Without distributed tracing you cannot:
- Debug failures — When Service D fails, trace back through A → B → C to find root cause
- Analyze latency — Break down a 3-second request into time spent at each hop
- Map dependencies — Discover actual service communication patterns
| Scenario | Without Tracing | With Tracing |
|---|---|---|
| Production incident | Hours of log correlation | Minutes to pinpoint |
| Performance regression | Guess-and-check | Immediate bottleneck ID |
| Dependency failure | Unclear blast radius | Exact impact mapping |
Architecture: Trace Propagation
┌───────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED TRACE FLOW │
│ │
│ traceparent: 00-<trace-id>-<span-id>-01 │
│ │
│ ┌──────────┐ HTTP ┌──────────┐ Service Bus ┌──────────┐ │
│ │ Frontend │──────────▶│ Order API│──────────────▶│ Payment │ │
│ │ (SPA) │ │ (.NET) │ │ (Node.js)│ │
│ └──────────┘ └────┬─────┘ └──────────┘ │
│ │ HTTP │
│ ┌────▼─────┐ │
│ │Inventory │ │
│ │ Service │ │
│ └────┬─────┘ │
│ │ Event Grid │
│ ┌────▼─────┐ │
│ │ Audit │ │
│ │ Service │ │
│ └──────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Azure Monitor / Application Insights │ │
│ │ requests | dependencies | traces | exceptions │ │
│ │ All correlated by operation_Id (trace-id) │ │
│ └─────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────┘
Understanding W3C Trace Context
The traceparent Header
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
── ──────────────────────────────── ──────────────── ──
version trace-id span-id flags
(operation_Id) (parent-id) (sampled)
The tracestate header carries vendor-specific data (e.g., azure=<correlation-data>).
Application Insights Correlation Model
| Concept | W3C Field | App Insights Field | Purpose |
|---|---|---|---|
| Operation ID | trace-id | operation_Id | Groups all telemetry for one transaction |
| Parent ID | parent span-id | operation_ParentId | Links child to parent span |
| Request ID | span-id | id | Unique ID for this operation |
Auto-Correlation Flow
- Inbound — SDK reads
traceparent, extracts trace-id asoperation_Id - Processing — Creates new span-id, sets parent-id to incoming span-id
- Outbound — Injects updated
traceparentinto outgoing HTTP/messaging calls
Step-by-Step Implementation
Application Insights SDK (.NET)
dotnet add package Microsoft.ApplicationInsights.AspNetCore
// Program.cs
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});
var app = builder.Build();
Application Insights SDK (Node.js)
npm install applicationinsights
// instrumentation.ts — import FIRST in entry point
import * as appInsights from "applicationinsights";
appInsights.setup(process.env.APPLICATIONINSIGHTS_CONNECTION_STRING)
.setAutoCollectRequests(true)
.setAutoCollectDependencies(true)
.setAutoCollectExceptions(true)
.setDistributedTracingMode(appInsights.DistributedTracingModes.AI_AND_W3C)
.start();
export const telemetryClient = appInsights.defaultClient;
OpenTelemetry Integration
dotnet add package Azure.Monitor.OpenTelemetry.AspNetCore
// .NET — OpenTelemetry with Azure Monitor exporter
using Azure.Monitor.OpenTelemetry.AspNetCore;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry().UseAzureMonitor(options =>
{
options.ConnectionString = builder.Configuration["ApplicationInsights:ConnectionString"];
});
// Node.js — OpenTelemetry
import { useAzureMonitor } from "@azure/monitor-opentelemetry";
useAzureMonitor({
azureMonitorExporterOptions: {
connectionString: process.env.APPLICATIONINSIGHTS_CONNECTION_STRING,
},
});
Custom Telemetry with TelemetryClient
public class OrderService(TelemetryClient telemetry)
{
public async Task<Order> ProcessOrder(string orderId)
{
using var operation = telemetry.StartOperation<RequestTelemetry>("ProcessOrder");
operation.Telemetry.Properties["OrderId"] = orderId;
try
{
var order = await _repository.GetAsync(orderId);
telemetry.TrackEvent("OrderProcessed", new Dictionary<string, string>
{
["OrderId"] = orderId, ["Amount"] = order.Total.ToString()
});
return order;
}
catch (Exception ex)
{
telemetry.TrackException(ex);
operation.Telemetry.Success = false;
throw;
}
}
}
Propagating Correlation Across Messaging
HTTP correlation is automatic. For Service Bus and Event Grid, propagation requires awareness:
// Service Bus — SDK auto-propagates Diagnostic-Id in ApplicationProperties
var sender = serviceBusClient.CreateSender("orders");
await sender.SendMessageAsync(new ServiceBusMessage(JsonSerializer.Serialize(order))
{
ApplicationProperties = { ["OrderId"] = order.Id }
});
// Diagnostic-Id is automatically added by Azure.Messaging.ServiceBus
// Event Grid — embed trace context in event data
var eventData = new { Order = order, TraceId = Activity.Current?.TraceId.ToString() };
await eventGridClient.SendEventAsync(
new EventGridEvent("orders/created", "Order.Created", "1.0", eventData));
Advanced Correlation Patterns
Service Bus Consumer Correlation
public async Task ProcessMessageAsync(ProcessMessageEventArgs args)
{
// Azure.Messaging.ServiceBus v7+ auto-links via Diagnostic-Id
// For manual control:
var diagnosticId = args.Message.ApplicationProperties
.TryGetValue("Diagnostic-Id", out var id) ? id.ToString() : null;
using var operation = _telemetry.StartOperation<RequestTelemetry>(
"ProcessOrderMessage",
operationId: diagnosticId?.Split('-')[1],
parentOperationId: diagnosticId?.Split('-')[2]);
operation.Telemetry.Properties["OrderId"] =
args.Message.ApplicationProperties["OrderId"]?.ToString();
await HandleOrder(args.Message);
}
Event Grid Consumer Correlation
[Function("HandleOrderEvent")]
public async Task Run([EventGridTrigger] EventGridEvent evt)
{
var data = evt.Data.ToObjectFromJson<OrderEventData>();
using var operation = _telemetry.StartOperation<RequestTelemetry>(
"HandleOrderEvent",
operationId: data.TraceId);
operation.Telemetry.Properties["EventType"] = evt.EventType;
await ProcessEvent(data);
}
Background Job Correlation
protected override async Task ExecuteAsync(CancellationToken ct)
{
while (!ct.IsCancellationRequested)
{
using var operation = _telemetry.StartOperation<RequestTelemetry>("InventorySync");
operation.Telemetry.Properties["JobType"] = "Scheduled";
try
{
var count = await SyncInventory(ct);
operation.Telemetry.Metrics["ItemsSynced"] = count;
}
catch (Exception ex)
{
_telemetry.TrackException(ex);
operation.Telemetry.Success = false;
}
await Task.Delay(TimeSpan.FromMinutes(5), ct);
}
}
Custom Multi-Step Operation Tracking
public async Task<OrderResult> PlaceOrder(OrderRequest request)
{
using var operation = _telemetry.StartOperation<RequestTelemetry>("PlaceOrder");
operation.Telemetry.Properties["CustomerId"] = request.CustomerId;
using (_telemetry.StartOperation<DependencyTelemetry>("ValidateOrder"))
await _validator.Validate(request);
using (var dep = _telemetry.StartOperation<DependencyTelemetry>("ReserveInventory"))
{
dep.Telemetry.Target = "inventory-service";
await _inventory.Reserve(request.Items);
}
using (var dep = _telemetry.StartOperation<DependencyTelemetry>("ProcessPayment"))
{
dep.Telemetry.Target = "payment-gateway";
await _payment.Charge(request.PaymentMethod, request.Total);
}
return new OrderResult { Success = true };
}
KQL Queries for Distributed Tracing
End-to-End Transaction Search
// All telemetry for a specific transaction
let opId = "4bf92f3577b34da6a3ce929d0e0e4736";
union requests, dependencies, traces, exceptions
| where operation_Id == opId
| project timestamp, itemType, name, duration, success, customDimensions
| order by timestamp asc
// Find transactions by business context
union requests, dependencies, traces
| where customDimensions.OrderId == "ORD-2024-001"
| order by timestamp asc
Dependency Chain Analysis
// Full dependency chain for slow requests
requests
| where name == "POST /api/orders" and duration > 3000
| take 10
| join kind=inner (
dependencies | project operation_Id, dep = name, target, dep_duration = duration
) on operation_Id
| summarize AvgTime = avg(dep_duration), Calls = count() by target, dep
| order by AvgTime desc
Latency Breakdown by Service
dependencies
| where timestamp > ago(1h)
| summarize P50 = percentile(duration, 50), P95 = percentile(duration, 95),
P99 = percentile(duration, 99) by target, name
| order by P95 desc
Failed Request Correlation
requests
| where success == false and timestamp > ago(1h)
| join kind=inner (
dependencies | where success == false
| project operation_Id, failed_dep = name, dep_target = target, resultCode
) on operation_Id
| summarize Failures = count() by failed_dep, dep_target, resultCode
| order by Failures desc
Application Map Data
dependencies
| where timestamp > ago(1h)
| summarize Calls = count(), Failures = countif(success == false),
AvgMs = avg(duration) by source = cloud_RoleName, target, type
| extend FailRate = round(100.0 * Failures / Calls, 1)
| where FailRate > 5 or AvgMs > 2000
Custom Dimensions and Metrics
Business Context via TelemetryInitializer
public class BusinessContextInitializer : ITelemetryInitializer
{
private readonly IHttpContextAccessor _http;
public BusinessContextInitializer(IHttpContextAccessor http) => _http = http;
public void Initialize(ITelemetry telemetry)
{
var ctx = _http.HttpContext;
if (ctx == null) return;
var props = (telemetry as ISupportProperties)?.Properties;
if (props == null) return;
if (ctx.Request.RouteValues.TryGetValue("orderId", out var oid))
props["OrderId"] = oid?.ToString();
var cid = ctx.User?.FindFirst("customer_id")?.Value;
if (cid != null) props["CustomerId"] = cid;
}
}
// Register: builder.Services.AddSingleton<ITelemetryInitializer, BusinessContextInitializer>();
Custom Metrics for SLA Tracking
public void TrackOrderLatency(TimeSpan duration)
{
_telemetry.GetMetric("OrderProcessingDuration").TrackValue(duration.TotalMilliseconds);
var slaThreshold = TimeSpan.FromSeconds(5);
_telemetry.GetMetric("OrderSlaCompliance").TrackValue(duration <= slaThreshold ? 1 : 0);
}
Sampling Strategies
// Adaptive (default) — auto-adjusts to ~5 items/sec
builder.Services.Configure<TelemetryConfiguration>(config =>
{
config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
.UseAdaptiveSampling(maxTelemetryItemsPerSecond: 20, excludedTypes: "Exception;Event");
});
// Fixed-rate — predictable volume
builder.Services.Configure<TelemetryConfiguration>(config =>
{
config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
.UseSampling(25); // 25% of telemetry
});
# Ingestion sampling via CLI
az monitor app-insights component update \
--app "my-app-insights" \
--resource-group "rg-monitoring" \
--sampling-percentage 50
Alerting on Trace Data
Smart Detection
Enabled by default. Automatically detects abnormal failure rates, performance degradation, and dependency issues.
Custom Alert Rules
# Alert: error rate > 5% in 5-minute window
az monitor scheduled-query create \
--name "HighErrorRate-OrderService" \
--resource-group "rg-monitoring" \
--scopes "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/microsoft.insights/components/my-app-insights" \
--condition "count > 50" \
--condition-query "requests | where cloud_RoleName == 'order-service' | where success == false | where timestamp > ago(5m)" \
--evaluation-frequency "5m" \
--window-size "5m" \
--severity 2 \
--action-groups "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/microsoft.insights/actionGroups/ops-team"
Failure Anomaly Detection (KQL)
// Cascading failures — multiple downstream services failing simultaneously
dependencies
| where timestamp > ago(10m) and success == false
| summarize FailedTargets = dcount(target), Total = count() by cloud_RoleName
| where FailedTargets >= 3
Real-World Debugging Scenarios
Finding the Slow Service
Symptom: Checkout takes 8+ seconds.
// Break down time by dependency
requests
| where name == "POST /api/checkout" and duration > 5000
| join kind=inner (
dependencies | project operation_Id, target, dep_duration = duration
) on operation_Id
| summarize Avg = avg(dep_duration), P95 = percentile(dep_duration, 95) by target
| order by Avg desc
Resolution: Inventory service averaging 6.2s due to N+1 database queries.
Identifying Intermittent Failures
Symptom: 2% of orders fail randomly.
requests
| where name == "POST /api/orders" and success == false
| join kind=inner (dependencies | where success == false) on operation_Id
| summarize Failures = count() by target, resultCode
| order by Failures desc
Resolution: Payment gateway returning 503 during nightly maintenance. Fixed with retry + exponential backoff.
Correlating Frontend to Backend
// Frontend — enable cross-origin correlation
const appInsights = new ApplicationInsights({
config: {
connectionString: "...",
enableCorsCorrelation: true,
correlationHeaderDomains: ["api.myapp.com"],
},
});
appInsights.loadAppInsights();
// Link slow page views to backend operations
pageViews
| where duration > 10000
| join kind=inner (requests) on operation_Id
| project user_Id, frontend_duration = duration, backend_name = name1, backend_duration = duration1
Resolution: Backend returned 200 in 2s but 4MB payload timed out on mobile. Fixed with pagination.
Best Practices
| Practice | Rationale |
|---|---|
Set cloud_RoleName on every service | Required for Application Map |
| Use W3C Trace Context (SDK default) | Cross-vendor standard |
| Add business IDs as custom dimensions | Query by OrderId, CustomerId |
| Never sample exceptions | Always need full error context |
| Exclude health checks from telemetry | Reduces noise |
| Use OpenTelemetry for new projects | Vendor-neutral, future-proof |
Infrastructure Setup
# Create Application Insights with Log Analytics workspace
az monitor app-insights component create \
--app "my-app-insights" \
--location "eastus" \
--resource-group "rg-monitoring" \
--kind "web" \
--workspace "/subscriptions/{sub}/resourceGroups/rg-monitoring/providers/Microsoft.OperationalInsights/workspaces/my-workspace"
# Get connection string
az monitor app-insights component show \
--app "my-app-insights" \
--resource-group "rg-monitoring" \
--query "connectionString" -o tsv
# Set daily cap for cost control
az monitor app-insights component billing update \
--app "my-app-insights" \
--resource-group "rg-monitoring" \
--cap 5
Common Pitfalls
- Broken correlation in async —
Task.Runloses Activity context. Useasync/awaitthroughout. - Missing Diagnostic-Id — Older Service Bus SDKs don't auto-propagate. Use v7+.
- No cloud_RoleName — Application Map shows one blob instead of service graph.
- Logging PII in dimensions — Custom dimensions are stored in plain text.
- 100% sampling in production — Use adaptive sampling for high-traffic services.
Summary
| Capability | Implementation |
|---|---|
| Auto-correlation | App Insights SDK + W3C Trace Context |
| Cross-service tracing | traceparent (HTTP), Diagnostic-Id (Service Bus) |
| Business context | Custom dimensions via TelemetryInitializer |
| Query & analysis | KQL across requests, dependencies, traces, exceptions |
| Alerting | Scheduled queries + Smart Detection |
| Cost control | Adaptive sampling + daily caps |
Start with the SDK for automatic correlation, add business context through custom dimensions, query with KQL for insights, and alert on anomalies. The combination gives you complete visibility — turning "something is slow" into "the inventory service SQL query is missing an index."