Distributed Tracing — End-to-End Correlation

Observability for Microservices Architecture


Introduction

In microservices architectures, a single user request can flow through dozens of services—a frontend API calls an orchestration function, which queues messages to Service Bus, triggers downstream functions, writes to databases, and calls external APIs. When something goes wrong, understanding where the failure occurred requires more than traditional logging. Distributed tracing provides visibility into the entire flow, making it essential for debugging, performance optimization, and understanding system behavior.

This comprehensive guide covers:

  • Distributed tracing fundamentals — Understanding traces and spans
  • Azure implementation — Application Insights integration
  • Correlation patterns — Connecting related operations
  • Custom instrumentation — Adding business-specific telemetry
  • Analysis and debugging — Using traces effectively
  • Performance optimization — Identifying bottlenecks

Understanding Distributed Tracing

How Tracing Works

┌────────────────────────────────────────────────────────────────────────────┐
│                 DISTRIBUTED TRACING CONCEPT                                │
├───────────────────────────────────────────────────────────────────────────-┤
│                                                                            │
│   Single Request Flow:                                                     │
│   ─────────────────────                                                    │
│                                                                            │
│   Client ──▶ API Gateway ──▶ Order Service ──▶ Service Bus ──▶ Inventory.  │
│    │                    │                    │               │             │
│    │                    │                    │               │             │
│    └────────────────────┴────────────────────┴───────────────┘             │
│    Trace ID: abc-123 (shared across all services)                          │
│                                                                            │
│   Visual Representation:                                                   │
│   ────────────────────                                                     │
│   ─────────┬──────────────┬─────────────────┬──────────────┬────────────┐  │
│    Trace   │ Span 1       │ Span 2          │ Span 3       │ Span 4     │  |
│    abc-123 │[API Gateway] │ [Order Service] │ [Service Bus]│ [Inventory]|  |
│            │              │                 │              │            │  |
│            │◀──200ms──────│◀──150ms─────────│◀──50ms───────│◀──30ms     │  |
│            │              │                 │              │            │  |
│            └──────────────┴─────────────────┴──────────────┴──────────--┘  │
│   Total Latency: 430ms                                                     │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Key Concepts

┌─────────────────────────────────────────────────────────────────────┐
│                     TRACING CONCEPTS                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   TRACE                                                             │
│   ─────                                                             │
│   - Entire journey of a single request                              │
│   - Unique identifier (Trace ID)                                    │
│   - Contains multiple spans                                         │
│                                                                     │
│   SPAN                                                              │
│   ─────                                                             │
│   - Single operation within a trace                                 │
│   - Has: name, start time, end time, status                         │
│   - Can have: events, tags, relationships                           │
│   - Parent-child relationships (Span hierarchy)                     │
│                                                                     │
│   SPAN ATTRIBUTES                                                   │
│   ───────────────                                                   │
│   ✓ span.name: "GET /api/orders"                                    │
│   ✓ span.kind: "client" or "server"                                 │
│   ✓ span.status: "ok" or "error"                                    │
│   ✓ span.tags: { "http.url", "db.statement", "user.id" }            │
│   ✓ span.events: { "exception", "retry", "cache_hit" }              │
│                                                                     │
│   CONTEXT PROPAGATION                                               │
│   ─────────────────────                                             │
│   - Trace ID passed between services via:                           │
│     • HTTP headers (W3C Trace Context)                              │
│     • Service Bus message properties                                │
│     • Event Grid event attributes                                   │
│     • Custom correlation IDs                                        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Azure Application Insights Tracing

Setup and Configuration

// Add Application Insights
// Program.cs
builder.Services.AddApplicationInsightsTelemetry();
builder.Services.AddOpenTelemetryTracing(tracing =>
{
    tracing.AddSource("MyFunctionApp")
        .SetResourceBuilder(ResourceBuilder.CreateDefault()
            .AddService("MyFunctionApp")
            .AddAttributes(new Dictionary<string, object>
            {
                ["service.version"] = "1.0.0",
                ["deployment.environment"] = "production"
            }))
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddAzureMonitorTraceExporter(o =>
        {
            o.ConnectionString = Environment.GetEnvironmentVariable("APPLICATIONINSIGHTS_CONNECTION_STRING");
        });
});

HTTP Request Tracing

public class TracedHttpFunction
{
    private readonly ILogger<TracedHttpFunction> _logger;
    private readonly HttpClient _httpClient;
    private readonly ActivitySource _activitySource;

    public TracedHttpFunction(
        ILogger<TracedHttpFunction> logger,
        IHttpClientFactory httpClientFactory)
    {
        _logger = logger;
        _httpClient = httpClientFactory.CreateClient();
        _activitySource = new ActivitySource("MyFunctionApp");
    }

    [Function("ProcessOrder")]
    public async Task<IActionResult> Run(
        [HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequest req)
    {
        using var activity = _activitySource.StartActivity(
            "ProcessOrder",
            ActivityKind.Server);

        activity?.SetTag("order.id", orderId);
        activity?.SetTag("operation.name", "Order Processing");

        try
        {
            // Validate order
            using (var validateActivity = _activitySource.StartActivity(
                "ValidateOrder", ActivityKind.Internal))
            {
                await ValidateOrderAsync(order);
            }

            // Call downstream service
            using (var httpActivity = _activitySource.StartActivity(
                "CallInventoryService", ActivityKind.Client))
            {
                httpActivity?.SetTag("http.url", inventoryServiceUrl);
                httpActivity?.SetTag("http.method", "POST");

                var response = await _httpClient.PostAsJsonAsync(
                    inventoryServiceUrl, order);

                httpActivity?.SetTag("http.status_code", response.StatusCode);
            }

            // Process result
            await SaveOrderAsync(order);

            return new OkResult();
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.AddEvent(new ActivityEvent("Exception",
                tags: new ActivityTagsCollection
                {
                    { "exception.type", ex.GetType().Name },
                    { "exception.message", ex.Message }
                }));
            throw;
        }
    }
}

Service Bus Tracing

public class TracedServiceBusFunction
{
    private readonly ILogger<TracedServiceBusFunction> _logger;

    [Function("ProcessQueue")]
    public async Task Run(
        [ServiceBusTrigger("orders-queue", Connection = "ServiceBusConnection")] 
        ServiceBusReceivedMessage message,
        ILogger log)
    {
        // Get trace context from message
        var traceparent = message.ApplicationProperties.TryGetValue("traceparent", out var tp) 
            ? tp?.ToString() 
            : null;

        using var activity = _activitySource.StartActivity(
            "ProcessQueueMessage",
            ActivityKind.Consumer,
            traceparent ?? default);

        activity?.SetTag("message.id", message.MessageId);
        activity?.SetTag("queue.name", "orders-queue");

        try
        {
            var order = JsonSerializer.Deserialize<Order>(message.Body.ToString());
            
            activity?.SetTag("order.id", order?.OrderId);
            
            // Process order - each operation creates a child span
            await ProcessOrderInternalAsync(order);
            
            await message.CompleteAsync();
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            throw;
        }
    }
}

// Producer: Add trace context to message
public async Task SendToQueueWithTracing(Order order)
{
    var activity = Activity.Current;
    var traceparent = activity?.Id;
    
    var message = new ServiceBusMessage(JsonSerializer.Serialize(order));
    message.ApplicationProperties["traceparent"] = traceparent;
    
    await sender.SendMessageAsync(message);
}

Custom Business Tracing

Add Business Context

public class BusinessTracingService
{
    private readonly ActivitySource _activitySource;

    public BusinessTracingService()
    {
        _activitySource = new ActivitySource("BusinessOperations");
    }

    public async Task<OrderProcessingResult> ProcessOrderWithTracing(Order order)
    {
        using var activity = _activitySource.StartActivity(
            "OrderProcessing",
            ActivityKind.Server,
            new ActivityContext(
                Activity.Current?.Id ?? default,
                Activity.Current?.TraceId ?? default,
                ActivityTraceFlags.Recorded));

        activity?.SetTag("order.id", order.OrderId);
        activity?.SetTag("order.customerId", order.CustomerId);
        activity?.SetTag("order.total", order.Total);
        activity?.SetTag("order.itemCount", order.Items.Count);

        // Track individual steps
        var steps = new List<string>();
        
        try
        {
            // Step 1: Inventory check
            using (var span = _activitySource.StartActivity("CheckInventory"))
            {
                var hasInventory = await CheckInventoryAsync(order);
                steps.Add($"inventory:{hasInventory}");
                
                if (!hasInventory)
                    throw new InsufficientInventoryException(order.OrderId);
            }

            // Step 2: Price validation
            using (var span = _activitySource.StartActivity("ValidatePrice"))
            {
                await ValidatePricingAsync(order);
                steps.Add("price:valid");
            }

            // Step 3: Reserve items
            using (var span = _activitySource.StartActivity("ReserveItems"))
            {
                await ReserveItemsAsync(order);
                steps.Add("items:reserved");
            }

            activity?.SetTag("processing.steps", string.Join(" → ", steps));
            
            return new OrderProcessingResult { Success = true };
        }
        catch (Exception ex)
        {
            activity?.SetTag("failure.step", steps.LastOrDefault() ?? "unknown");
            activity?.SetTag("failure.reason", ex.Message);
            throw;
        }
    }
}

Custom Events and Metrics

public class CustomTelemetry
{
    private readonly TelemetryClient _telemetryClient;

    // Custom events
    public void TrackOrderEvent(string orderId, string eventType, Dictionary<string, string> properties)
    {
        var propertiesWithContext = new Dictionary<string, string>
        {
            { "orderId", orderId },
            { "eventType", eventType },
            { "timestamp", DateTime.UtcNow.ToString("O") },
            { "correlationId", Activity.Current?.TraceId.ToString() ?? "" }
        };
        
        foreach (var prop in properties)
            propertiesWithContext[prop.Key] = prop.Value;

        _telemetryClient.TrackEvent($"Order{eventType}", propertiesWithContext);
    }

    // Custom metrics
    public void TrackProcessingTime(string operation, TimeSpan duration, bool success)
    {
        var metrics = new Dictionary<string, double>
        {
            { "ProcessingDurationMs", duration.TotalMilliseconds },
            { "Success", success ? 1 : 0 }
        };

        _telemetryClient.GetMetric(
            $"Integration.{operation}.Duration",
            "Operation",
            "Outcome").Track(duration.TotalMilliseconds, operation, success ? "Success" : "Failure");
    }

    // Dependency tracking
    public async Task<T> TrackDependency<T>(string dependencyType, string target, Func<Task<T>> operation)
    {
        var startTime = DateTime.UtcNow;
        
        using (var operation2 = _activitySource.StartActivity(
            dependencyType, ActivityKind.Client))
        {
            operation2?.SetTag("db.system", "sql");
            operation2?.SetTag("db.statement", target);

            try
            {
                var result = await operation();
                
                operation2?.SetStatus(ActivityStatusCode.Ok);
                
                return result;
            }
            catch (Exception ex)
            {
                operation2?.SetStatus(ActivityStatusCode.Error, ex.Message);
                throw;
            }
            finally
            {
                operation2?.SetTag("duration", (DateTime.UtcNow - startTime).TotalMilliseconds);
            }
        }
    }
}

Analysis and Debugging

Application Insights Queries

// Trace for specific operation
traces
| where message contains "ProcessOrder"
| where timestamp > ago(1h)
| extend traceId = operation_Id
| project timestamp, message, traceId, customDimensions

// Trace with dependencies
requests
| where name == "ProcessOrder"
| extend traceId = operation_Id
| join kind=inner (dependencies) on traceId
| project timestamp, name, target, duration, success

// Performance analysis
requests
| where timestamp > ago(24h)
| summarize 
    avgDuration = avg(duration),
    p50 = percentile(duration, 50),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99),
    totalCount = count()
  by name
| order by avgDuration desc

// Error analysis
requests
| where success == false
| summarize errorCount = count() by issueType = coalesce(customDimensions.exceptionType, name)
| order by errorCount desc

// End-to-end transaction search
dependencies
| where timestamp > ago(1h)
| where target contains "servicebus"
| project timestamp, name, target, type, data, operation_Id
| join kind=inner (requests) on operation_Id
| project requestName = requests.name, dependencyName = name, dependencyTarget = target, dependencyDuration = duration

Visualizing Traces

Transaction Search

┌─────────────────────────────────────────────────────────────────────┐
│                  TRANSACTION SEARCH RESULTS                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Trace Timeline (Gantt Chart):                                     │
│   ────────────────────────────                                      │
│                                                                     │
│   [150ms] API Gateway                                               │
│   ├────────────┬──────────────────────────────────────────────────┐ │
│   │ 50ms       │ 300ms Order Service                              │ │
│   │            │ ├──────────┬──────────┬──────────┬────────────┐  │ │
│   │            │ │ Validate │ ServiceBus│ Inventory│ Database  │  │ │
│   │            │ │  (30ms)  │ (100ms)   │ (50ms)   │ (20ms)    │  │ │
│   │            │ └──────────┴──────────┴──────────┴────────────┘  │ │
│   └────────────┴──────────────────────────────────────────────────┘ │
│                                                                     │
│   Properties:                                                       │
│   ✓ Trace ID: abc-123-def-456                                       │
│   ✓ Duration: 350ms                                                 │
│   ✓ Success: true                                                   │
│   ✓ User: john.doe@company.com                                      │
│   ✓ Order ID: ORD-12345                                             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Best Practices

Implementation Checklist

PracticeDescription
Use ActivitySourceModern .NET tracing API
Propagate contextPass trace IDs across all boundaries
Add business tagsInclude order IDs, user IDs in traces
Track failuresRecord error details in spans
Sample appropriatelyBalance volume vs. visibility
Correlate logsLink logs to trace IDs

Sampling Strategy

// Configure adaptive sampling
services.Configure<TelemetryConfiguration>(config =>
{
    var adaptiveSampling = config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .UseAdaptiveSampling(100, 10); // 100 items/sec, min 10 items

    // Don't sample:
    // - Critical operations
    // - Errors
    // - Custom events
    config.DefaultTelemetrySink.TelemetryProcessorChainBuilder
        .Use((next) => new NeverSampleProcessor(next), 
            args => args is RequestTelemetry or DependencyTelemetry);
});

Related Topics


Azure Integration Hub - Architect Level Observability & Operations at Scale