Distributed Tracing — End-to-End Correlation
Observability for Microservices Architecture
Introduction
In microservices architectures, a single user request can flow through dozens of services—a frontend API calls an orchestration function, which queues messages to Service Bus, triggers downstream functions, writes to databases, and calls external APIs. When something goes wrong, understanding where the failure occurred requires more than traditional logging. Distributed tracing provides visibility into the entire flow, making it essential for debugging, performance optimization, and understanding system behavior.
This comprehensive guide covers:
- Distributed tracing fundamentals — Understanding traces and spans
- Azure implementation — Application Insights integration
- Correlation patterns — Connecting related operations
- Custom instrumentation — Adding business-specific telemetry
- Analysis and debugging — Using traces effectively
- Performance optimization — Identifying bottlenecks
Understanding Distributed Tracing
How Tracing Works
┌────────────────────────────────────────────────────────────────────────────┐
│ DISTRIBUTED TRACING CONCEPT │
├───────────────────────────────────────────────────────────────────────────-┤
│ │
│ Single Request Flow: │
│ ───────────────────── │
│ │
│ Client ──▶ API Gateway ──▶ Order Service ──▶ Service Bus ──▶ Inventory. │
│ │ │ │ │ │
│ │ │ │ │ │
│ └────────────────────┴────────────────────┴───────────────┘ │
│ Trace ID: abc-123 (shared across all services) │
│ │
│ Visual Representation: │
│ ──────────────────── │
│ ─────────┬──────────────┬─────────────────┬──────────────┬────────────┐ │
│ Trace │ Span 1 │ Span 2 │ Span 3 │ Span 4 │ |
│ abc-123 │[API Gateway] │ [Order Service] │ [Service Bus]│ [Inventory]| |
│ │ │ │ │ │ |
│ │◀──200ms──────│◀──150ms─────────│◀──50ms───────│◀──30ms │ |
│ │ │ │ │ │ |
│ └──────────────┴─────────────────┴──────────────┴──────────--┘ │
│ Total Latency: 430ms │
│ │
└────────────────────────────────────────────────────────────────────────────┘
Key Concepts
┌─────────────────────────────────────────────────────────────────────┐
│ TRACING CONCEPTS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ TRACE │
│ ───── │
│ - Entire journey of a single request │
│ - Unique identifier (Trace ID) │
│ - Contains multiple spans │
│ │
│ SPAN │
│ ───── │
│ - Single operation within a trace │
│ - Has: name, start time, end time, status │
│ - Can have: events, tags, relationships │
│ - Parent-child relationships (Span hierarchy) │
│ │
│ SPAN ATTRIBUTES │
│ ─────────────── │
│ ✓ span.name: "GET /api/orders" │
│ ✓ span.kind: "client" or "server" │
│ ✓ span.status: "ok" or "error" │
│ ✓ span.tags: { "http.url", "db.statement", "user.id" } │
│ ✓ span.events: { "exception", "retry", "cache_hit" } │
│ │
│ CONTEXT PROPAGATION │
│ ───────────────────── │
│ - Trace ID passed between services via: │
│ • HTTP headers (W3C Trace Context) │
│ • Service Bus message properties │
│ • Event Grid event attributes │
│ • Custom correlation IDs │
│ │
└─────────────────────────────────────────────────────────────────────┘
Azure Application Insights Tracing
Setup and Configuration
// Add Application Insights
// Program.cs
builder.Services.AddApplicationInsightsTelemetry();
builder.Services.AddOpenTelemetryTracing(tracing =>
{
tracing.AddSource("MyFunctionApp")
.SetResourceBuilder(ResourceBuilder.CreateDefault()
.AddService("MyFunctionApp")
.AddAttributes(new Dictionary<string, object>
{
["service.version"] = "1.0.0",
["deployment.environment"] = "production"
}))
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddAzureMonitorTraceExporter(o =>
{
o.ConnectionString = Environment.GetEnvironmentVariable("APPLICATIONINSIGHTS_CONNECTION_STRING");
});
});
HTTP Request Tracing
public class TracedHttpFunction
{
private readonly ILogger<TracedHttpFunction> _logger;
private readonly HttpClient _httpClient;
private readonly ActivitySource _activitySource;
public TracedHttpFunction(
ILogger<TracedHttpFunction> logger,
IHttpClientFactory httpClientFactory)
{
_logger = logger;
_httpClient = httpClientFactory.CreateClient();
_activitySource = new ActivitySource("MyFunctionApp");
}
[Function("ProcessOrder")]
public async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequest req)
{
using var activity = _activitySource.StartActivity(
"ProcessOrder",
ActivityKind.Server);
activity?.SetTag("order.id", orderId);
activity?.SetTag("operation.name", "Order Processing");
try
{
// Validate order
using (var validateActivity = _activitySource.StartActivity(
"ValidateOrder", ActivityKind.Internal))
{
await ValidateOrderAsync(order);
}
// Call downstream service
using (var httpActivity = _activitySource.StartActivity(
"CallInventoryService", ActivityKind.Client))
{
httpActivity?.SetTag("http.url", inventoryServiceUrl);
httpActivity?.SetTag("http.method", "POST");
var response = await _httpClient.PostAsJsonAsync(
inventoryServiceUrl, order);
httpActivity?.SetTag("http.status_code", response.StatusCode);
}
// Process result
await SaveOrderAsync(order);
return new OkResult();
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.AddEvent(new ActivityEvent("Exception",
tags: new ActivityTagsCollection
{
{ "exception.type", ex.GetType().Name },
{ "exception.message", ex.Message }
}));
throw;
}
}
}
Service Bus Tracing
public class TracedServiceBusFunction
{
private readonly ILogger<TracedServiceBusFunction> _logger;
[Function("ProcessQueue")]
public async Task Run(
[ServiceBusTrigger("orders-queue", Connection = "ServiceBusConnection")]
ServiceBusReceivedMessage message,
ILogger log)
{
// Get trace context from message
var traceparent = message.ApplicationProperties.TryGetValue("traceparent", out var tp)
? tp?.ToString()
: null;
using var activity = _activitySource.StartActivity(
"ProcessQueueMessage",
ActivityKind.Consumer,
traceparent ?? default);
activity?.SetTag("message.id", message.MessageId);
activity?.SetTag("queue.name", "orders-queue");
try
{
var order = JsonSerializer.Deserialize<Order>(message.Body.ToString());
activity?.SetTag("order.id", order?.OrderId);
// Process order - each operation creates a child span
await ProcessOrderInternalAsync(order);
await message.CompleteAsync();
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
throw;
}
}
}
// Producer: Add trace context to message
public async Task SendToQueueWithTracing(Order order)
{
var activity = Activity.Current;
var traceparent = activity?.Id;
var message = new ServiceBusMessage(JsonSerializer.Serialize(order));
message.ApplicationProperties["traceparent"] = traceparent;
await sender.SendMessageAsync(message);
}
Custom Business Tracing
Add Business Context
public class BusinessTracingService
{
private readonly ActivitySource _activitySource;
public BusinessTracingService()
{
_activitySource = new ActivitySource("BusinessOperations");
}
public async Task<OrderProcessingResult> ProcessOrderWithTracing(Order order)
{
using var activity = _activitySource.StartActivity(
"OrderProcessing",
ActivityKind.Server,
new ActivityContext(
Activity.Current?.Id ?? default,
Activity.Current?.TraceId ?? default,
ActivityTraceFlags.Recorded));
activity?.SetTag("order.id", order.OrderId);
activity?.SetTag("order.customerId", order.CustomerId);
activity?.SetTag("order.total", order.Total);
activity?.SetTag("order.itemCount", order.Items.Count);
// Track individual steps
var steps = new List<string>();
try
{
// Step 1: Inventory check
using (var span = _activitySource.StartActivity("CheckInventory"))
{
var hasInventory = await CheckInventoryAsync(order);
steps.Add($"inventory:{hasInventory}");
if (!hasInventory)
throw new InsufficientInventoryException(order.OrderId);
}
// Step 2: Price validation
using (var span = _activitySource.StartActivity("ValidatePrice"))
{
await ValidatePricingAsync(order);
steps.Add("price:valid");
}
// Step 3: Reserve items
using (var span = _activitySource.StartActivity("ReserveItems"))
{
await ReserveItemsAsync(order);
steps.Add("items:reserved");
}
activity?.SetTag("processing.steps", string.Join(" → ", steps));
return new OrderProcessingResult { Success = true };
}
catch (Exception ex)
{
activity?.SetTag("failure.step", steps.LastOrDefault() ?? "unknown");
activity?.SetTag("failure.reason", ex.Message);
throw;
}
}
}
Custom Events and Metrics
public class CustomTelemetry
{
private readonly TelemetryClient _telemetryClient;
// Custom events
public void TrackOrderEvent(string orderId, string eventType, Dictionary<string, string> properties)
{
var propertiesWithContext = new Dictionary<string, string>
{
{ "orderId", orderId },
{ "eventType", eventType },
{ "timestamp", DateTime.UtcNow.ToString("O") },
{ "correlationId", Activity.Current?.TraceId.ToString() ?? "" }
};
foreach (var prop in properties)
propertiesWithContext[prop.Key] = prop.Value;
_telemetryClient.TrackEvent($"Order{eventType}", propertiesWithContext);
}
// Custom metrics
public void TrackProcessingTime(string operation, TimeSpan duration, bool success)
{
var metrics = new Dictionary<string, double>
{
{ "ProcessingDurationMs", duration.TotalMilliseconds },
{ "Success", success ? 1 : 0 }
};
_telemetryClient.GetMetric(
$"Integration.{operation}.Duration",
"Operation",
"Outcome").Track(duration.TotalMilliseconds, operation, success ? "Success" : "Failure");
}
// Dependency tracking
public async Task<T> TrackDependency<T>(string dependencyType, string target, Func<Task<T>> operation)
{
var startTime = DateTime.UtcNow;
using (var operation2 = _activitySource.StartActivity(
dependencyType, ActivityKind.Client))
{
operation2?.SetTag("db.system", "sql");
operation2?.SetTag("db.statement", target);
try
{
var result = await operation();
operation2?.SetStatus(ActivityStatusCode.Ok);
return result;
}
catch (Exception ex)
{
operation2?.SetStatus(ActivityStatusCode.Error, ex.Message);
throw;
}
finally
{
operation2?.SetTag("duration", (DateTime.UtcNow - startTime).TotalMilliseconds);
}
}
}
}
Analysis and Debugging
Application Insights Queries
// Trace for specific operation
traces
| where message contains "ProcessOrder"
| where timestamp > ago(1h)
| extend traceId = operation_Id
| project timestamp, message, traceId, customDimensions
// Trace with dependencies
requests
| where name == "ProcessOrder"
| extend traceId = operation_Id
| join kind=inner (dependencies) on traceId
| project timestamp, name, target, duration, success
// Performance analysis
requests
| where timestamp > ago(24h)
| summarize
avgDuration = avg(duration),
p50 = percentile(duration, 50),
p95 = percentile(duration, 95),
p99 = percentile(duration, 99),
totalCount = count()
by name
| order by avgDuration desc
// Error analysis
requests
| where success == false
| summarize errorCount = count() by issueType = coalesce(customDimensions.exceptionType, name)
| order by errorCount desc
// End-to-end transaction search
dependencies
| where timestamp > ago(1h)
| where target contains "servicebus"
| project timestamp, name, target, type, data, operation_Id
| join kind=inner (requests) on operation_Id
| project requestName = requests.name, dependencyName = name, dependencyTarget = target, dependencyDuration = duration
Visualizing Traces
Transaction Search
┌─────────────────────────────────────────────────────────────────────┐
│ TRANSACTION SEARCH RESULTS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Trace Timeline (Gantt Chart): │
│ ──────────────────────────── │
│ │
│ [150ms] API Gateway │
│ ├────────────┬──────────────────────────────────────────────────┐ │
│ │ 50ms │ 300ms Order Service │ │
│ │ │ ├──────────┬──────────┬──────────┬────────────┐ │ │
│ │ │ │ Validate │ ServiceBus│ Inventory│ Database │ │ │
│ │ │ │ (30ms) │ (100ms) │ (50ms) │ (20ms) │ │ │
│ │ │ └──────────┴──────────┴──────────┴────────────┘ │ │
│ └────────────┴──────────────────────────────────────────────────┘ │
│ │
│ Properties: │
│ ✓ Trace ID: abc-123-def-456 │
│ ✓ Duration: 350ms │
│ ✓ Success: true │
│ ✓ User: john.doe@company.com │
│ ✓ Order ID: ORD-12345 │
│ │
└─────────────────────────────────────────────────────────────────────┘
Best Practices
Implementation Checklist
| Practice | Description |
|---|---|
| Use ActivitySource | Modern .NET tracing API |
| Propagate context | Pass trace IDs across all boundaries |
| Add business tags | Include order IDs, user IDs in traces |
| Track failures | Record error details in spans |
| Sample appropriately | Balance volume vs. visibility |
| Correlate logs | Link logs to trace IDs |
Sampling Strategy
// Configure adaptive sampling
services.Configure<TelemetryConfiguration>(config =>
{
var adaptiveSampling = config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
.UseAdaptiveSampling(100, 10); // 100 items/sec, min 10 items
// Don't sample:
// - Critical operations
// - Errors
// - Custom events
config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
.Use((next) => new NeverSampleProcessor(next),
args => args is RequestTelemetry or DependencyTelemetry);
});
Related Topics
- SLI / SLO / SLA Definition — Service level objectives
- Azure Monitor Enterprise Alerting — Alert configuration
- Log Analytics Architecture — Log management
Azure Integration Hub - Architect Level Observability & Operations at Scale