01Monitoring & Observability Overview
Modern cloud systems are too complex to debug with guesswork. Observability is the ability to understand a system's internal state from its external outputs β the three pillars being logs, metrics, and traces. Azure's monitoring stack maps these pillars to specific services that work together:
The Three Pillars of Observability
| Pillar | What It Answers | Azure Service | Data Type |
|---|---|---|---|
| Logs | What happened? What were the inputs/outputs? | Log Analytics / App Insights | Structured JSON events |
| Metrics | How is the system performing right now? | Azure Monitor Metrics | Numeric time-series |
| Traces | Why did this request take so long? What path did it take? | App Insights / Jaeger / Tempo | Distributed spans |
How the Stack Fits Together
Your Application (.NET 8 / Functions / AKS pods)
β
β emits via OpenTelemetry SDK or App Insights SDK
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Azure Monitor (the platform) β
β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β Application Insightsβ β Log Analytics β β
β β (APM + traces + β β Workspace β β
β β exceptions + β β (logs from ALL β β
β β custom events) β β Azure resources) β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββ β
β β Metrics Store β β Alerts Engine β β
β β (time-series, β β (rules, action β β
β β auto-collected β β groups, PagerDuty) β β
β β from all services)β β β β
β βββββββββββββββββββββββ ββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
Workbooks Β· Dashboards Β· Grafana Β· Power BI02Application Insights
Application Insights is Azure's Application Performance Management (APM)service. It automatically collects request telemetry, dependency calls (SQL, HTTP, Service Bus), exceptions, performance counters, and custom events β giving you end-to-end visibility into every user interaction and every background job.
| Telemetry Type | Auto-Collected? | Description |
|---|---|---|
| Requests | β Yes | Every inbound HTTP request β duration, status, URL, method |
| Dependencies | β Yes | Outbound HTTP calls, SQL queries, Service Bus, Redis, Storage |
| Exceptions | β Yes | Unhandled exceptions with stack traces and request context |
| Traces | β Yes | ILogger output β severity, message, properties |
| Performance Counters | β Yes | CPU, memory, GC, thread count β from host OS |
| Custom Events | Manual | Business events β OrderPlaced, PaymentFailed, FeatureUsed |
| Custom Metrics | Manual | Numeric measurements β queue depth, cache hit rate |
| Page Views | JS SDK | Browser page load times, user sessions, demographics |
| Availability | Configured | Synthetic ping / multi-step tests from global locations |
02aSDK Setup & Auto-Instrumentation
Getting Application Insights into your .NET 8 service takes just a few lines of configuration β the SDK handles auto-instrumentation of HTTP requests, dependency calls, and exceptions out of the box. The modern approach uses the OpenTelemetry-based Azure Monitor exporter, which gives you vendor-neutral instrumentation with Azure-native export. Always use connection strings (not instrumentation keys) and store them in Key Vault β never hardcode secrets in your application code. You can also enrich every telemetry item with custom properties using ITelemetryInitializer, which is invaluable for filtering by service version, environment, or correlation IDs during incident investigation.
Connection-String Based Setup (.NET 8)
// NuGet: Microsoft.ApplicationInsights.AspNetCore
// NuGet: Azure.Monitor.OpenTelemetry.AspNetCore β modern OpenTelemetry-based
// ββ Option A: Azure Monitor OpenTelemetry (recommended for new projects) ββ
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry()
.UseAzureMonitor(options =>
{
// Connection string from Key Vault / App Settings β never hardcode
options.ConnectionString =
builder.Configuration["ApplicationInsights:ConnectionString"];
// Sampling β see Section 02c
options.SamplingRatio = 0.1f; // 10% sampling in production
});
// ββ Option B: Classic App Insights SDK ββββββββββββββββββββββββββββββββ
builder.Services.AddApplicationInsightsTelemetry(options =>
{
options.ConnectionString =
builder.Configuration["ApplicationInsights:ConnectionString"];
options.EnableAdaptiveSampling = true;
options.EnableHeartbeat = true;
options.EnableDebugLogger = false; // Off in production
});
// ββ appsettings.json ββββββββββββββββββββββββββββββββββββββββββββββββββ
// {
// "ApplicationInsights": {
// "ConnectionString": "@Microsoft.KeyVault(VaultName=myVault;SecretName=ai-conn-string)"
// }
// }
//
// Connection string format:
// InstrumentationKey=<key>;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;...Worker Service / Background Jobs
// NuGet: Microsoft.ApplicationInsights.WorkerService
builder.Services.AddApplicationInsightsTelemetryWorkerService(options =>
{
options.ConnectionString =
builder.Configuration["ApplicationInsights:ConnectionString"];
});
// Azure Functions β auto-configured via host.json
// host.json:
// {
// "logging": {
// "applicationInsights": {
// "samplingSettings": { "isEnabled": true, "maxTelemetryItemsPerSecond": 20 }
// }
// }
// }Enriching All Telemetry with Custom Properties
// ITelemetryInitializer β runs on every telemetry item before it's sent
public class ServiceTelemetryInitializer : ITelemetryInitializer
{
private readonly IHttpContextAccessor _http;
public ServiceTelemetryInitializer(IHttpContextAccessor http) => _http = http;
public void Initialize(ITelemetry telemetry)
{
// Tag every item with service metadata
telemetry.Context.Cloud.RoleName = "orders-api";
telemetry.Context.Cloud.RoleInstance = Environment.MachineName;
// Add custom global dimensions
if (telemetry is ISupportProperties props)
{
props.Properties["ServiceVersion"] = "1.4.2";
props.Properties["Environment"] = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT") ?? "Unknown";
// Propagate correlation IDs from inbound request headers
var ctx = _http.HttpContext;
if (ctx is not null)
{
var correlationId = ctx.Request.Headers["X-Correlation-Id"].FirstOrDefault();
if (!string.IsNullOrEmpty(correlationId))
props.Properties["CorrelationId"] = correlationId;
var userId = ctx.User.FindFirst("oid")?.Value;
if (!string.IsNullOrEmpty(userId))
props.Properties["UserId"] = userId;
}
}
}
}
// Register in DI
services.AddSingleton<ITelemetryInitializer, ServiceTelemetryInitializer>();02bCustom Telemetry
While auto-instrumentation captures HTTP requests and dependencies, custom telemetry lets you track what matters to your business β events like "OrderPlaced" or "PaymentFailed", custom metrics like queue depth or cache hit rate, and manual dependency tracking for non-HTTP calls. This is where observability becomes truly powerful: you can correlate technical performance with business outcomes. Use TelemetryClient for custom events and metrics, ILogger with structured properties for contextual traces, and operation holders to group related telemetry items under a single operation ID for end-to-end correlation.
Tracking Custom Events, Metrics & Dependencies
public class OrderService(TelemetryClient telemetry, ILogger<OrderService> logger)
{
public async Task<Order> CreateOrderAsync(CreateOrderCommand cmd, CancellationToken ct)
{
// ββ Custom Event β business milestone βββββββββββββββββββββββββ
telemetry.TrackEvent("OrderCreated", new Dictionary<string, string>
{
["CustomerId"] = cmd.CustomerId.ToString(),
["Region"] = cmd.Region,
["ItemCount"] = cmd.LineItems.Count.ToString(),
},
new Dictionary<string, double>
{
["OrderAmount"] = (double)cmd.TotalAmount,
});
// ββ Custom Metric β numeric measurement βββββββββββββββββββββββ
telemetry.TrackMetric("Order.TotalAmount", (double)cmd.TotalAmount,
new Dictionary<string, string> { ["Region"] = cmd.Region });
// ββ Custom Dependency β any external call βββββββββββββββββββββ
var startTime = DateTimeOffset.UtcNow;
var timer = Stopwatch.StartNew();
bool success = false;
try
{
var result = await _inventoryClient.ReserveItemsAsync(cmd.LineItems, ct);
success = result.IsSuccess;
return result.Value;
}
catch (Exception ex)
{
// ββ Exception with extra context ββββββββββββββββββββββββββ
telemetry.TrackException(ex, new Dictionary<string, string>
{
["OrderId"] = cmd.OrderId.ToString(),
["CustomerId"] = cmd.CustomerId.ToString(),
["Operation"] = "CreateOrder",
});
throw;
}
finally
{
timer.Stop();
telemetry.TrackDependency(
dependencyTypeName: "gRPC",
target: "inventory-api",
dependencyName: "InventoryClient.ReserveItems",
data: $"CustomerId={cmd.CustomerId}",
startTime: startTime,
duration: timer.Elapsed,
resultCode: success ? "200" : "500",
success: success);
}
}
}
// ββ Structured logging β ILogger feeds into App Insights traces βββββββ
logger.LogInformation(
"Order {OrderId} created for customer {CustomerId} β amount {Amount:C}",
order.Id, order.CustomerId, order.TotalAmount);
// ββ Using scopes for correlated log groups ββββββββββββββββββββββββββββ
using (logger.BeginScope(new Dictionary<string, object>
{
["OrderId"] = order.Id,
["CustomerId"] = order.CustomerId,
["RequestId"] = Activity.Current?.TraceId.ToString() ?? ""
}))
{
logger.LogInformation("Starting payment processing");
await ProcessPaymentAsync(order, ct);
logger.LogInformation("Payment processed successfully");
}Operation Tracking β Grouping Related Telemetry
// Use IOperationHolder to group related telemetry items
// All items inside the using block share the same operation ID
public async Task ProcessMessageAsync(ServiceBusReceivedMessage message)
{
using var operation = telemetry.StartOperation<RequestTelemetry>(
"ServiceBus.ProcessOrder");
operation.Telemetry.Properties["MessageId"] = message.MessageId;
operation.Telemetry.Properties["QueueName"] = "orders";
try
{
var order = JsonSerializer.Deserialize<OrderCreatedEvent>(message.Body);
await ProcessOrderInternalAsync(order!);
operation.Telemetry.Success = true;
operation.Telemetry.ResponseCode = "200";
}
catch (Exception ex)
{
operation.Telemetry.Success = false;
operation.Telemetry.ResponseCode = "500";
telemetry.TrackException(ex);
throw;
}
}02cSampling & Cost Control
Application Insights charges per GB ingested. For high-traffic services, sampling is essential β it reduces data volume while preserving statistical accuracy and keeping correlated telemetry together (all spans of one trace are either all sampled or all dropped).
| Sampling Type | Where Configured | How It Works | Best For |
|---|---|---|---|
| Adaptive Sampling | SDK (default on) | Auto-adjusts rate to stay under target events/sec | Variable traffic β production default |
| Fixed-Rate Sampling | SDK config | Fixed % of operations sampled β predictable volume | Predictable billing, A/B comparison |
| Ingestion Sampling | App Insights portal | Drops data after arrival β no SDK change needed | Quick cost reduction without redeployment |
| OpenTelemetry Sampling | OTel SDK | Head-based or tail-based at trace level | Modern OTel pipelines |
// Adaptive sampling β recommended for most services
services.AddApplicationInsightsTelemetry(options =>
{
options.EnableAdaptiveSampling = true;
});
services.Configure<TelemetryConfiguration>(config =>
{
var adaptiveSamplingProcessor = new AdaptiveSamplingTelemetryProcessor(
new SamplingPercentageEstimatorSettings
{
MaxTelemetryItemsPerSecond = 5, // Target: max 5 events/sec per instance
MinSamplingPercentage = 0.1, // Never sample less than 0.1%
MaxSamplingPercentage = 100, // Full sampling when low traffic
EvaluationInterval = TimeSpan.FromSeconds(15),
SamplingPercentageDecreaseTimeout = TimeSpan.FromMinutes(2),
SamplingPercentageIncreaseTimeout = TimeSpan.FromMinutes(15),
},
next: config.TelemetryProcessorChainBuilder.Build());
config.TelemetryProcessorChainBuilder
.Use(_ => adaptiveSamplingProcessor)
.Build();
});
// ββ Fixed-rate sampling β predictable volume ββββββββββββββββββββββββββ
services.Configure<TelemetryConfiguration>(config =>
{
config.TelemetryProcessorChainBuilder
.UseSampling(samplingPercentage: 10) // Sample 10% of operations
.Build();
});
// ββ Exclude specific telemetry from sampling ββββββββββββββββββββββββββ
public class ExcludeHealthCheckFilter : ITelemetryProcessor
{
private readonly ITelemetryProcessor _next;
public ExcludeHealthCheckFilter(ITelemetryProcessor next) => _next = next;
public void Process(ITelemetry item)
{
// Don't send health check telemetry to App Insights at all
if (item is RequestTelemetry req &&
req.Url?.AbsolutePath.StartsWith("/health") == true)
return;
_next.Process(item);
}
}
services.Configure<TelemetryConfiguration>(config =>
{
config.TelemetryProcessorChainBuilder
.Use(next => new ExcludeHealthCheckFilter(next))
.UseSampling(10)
.Build();
});02dAvailability Tests
Availability tests run synthetic requests to your endpoints from Azure global PoP locations on a schedule β detecting outages from the user's perspective before your monitoring picks them up internally.
| Test Type | Description | Best For |
|---|---|---|
| Standard (URL ping) | Simple HTTP GET/POST to one URL β checks status code and response time | API health endpoints, uptime SLA monitoring |
| Multi-step (TrackAvailability) | Custom code simulating a user journey β login, search, checkout | Critical user flows, end-to-end smoke tests |
| Custom TrackAvailability | Emit availability telemetry from your own infrastructure | Private endpoints not reachable from Azure PoPs |
// Custom availability test via Azure Function (for private endpoints)
[FunctionName("AvailabilityTest")]
public async Task Run([TimerTrigger("0 */5 * * * *")] TimerInfo timer)
{
var testName = "Orders API β Create Order Flow";
var runLocation = "East US";
var startTime = DateTimeOffset.UtcNow;
var timer2 = Stopwatch.StartNew();
bool success = false;
string message = "";
try
{
// Step 1: Authenticate
var token = await GetTestTokenAsync();
// Step 2: Create a test order
_httpClient.DefaultRequestHeaders.Authorization =
new AuthenticationHeaderValue("Bearer", token);
var response = await _httpClient.PostAsJsonAsync(
"/api/v1/orders",
new { customerId = TestCustomerId, lineItems = TestLineItems });
success = response.IsSuccessStatusCode;
message = $"Status: {(int)response.StatusCode}";
// Step 3: Verify the order exists
if (success)
{
var order = await response.Content.ReadFromJsonAsync<OrderResponse>();
var getResponse = await _httpClient.GetAsync($"/api/v1/orders/{order!.OrderId}");
success = getResponse.IsSuccessStatusCode;
message += $" | GET: {(int)getResponse.StatusCode}";
}
}
catch (Exception ex)
{
success = false;
message = ex.Message;
}
finally
{
timer2.Stop();
_telemetry.TrackAvailability(
name: testName,
timeStamp: startTime,
duration: timer2.Elapsed,
runLocation: runLocation,
success: success,
message: message);
}
}02eApplication Map & Live Metrics
Application Map automatically renders your service topology β nodes for each component, edges for dependency calls, failure rates and latency on each edge. Live Metrics streams real-time telemetry with sub-second latency β essential during deployments and incident response.
// ββ App Insights KQL β Failure rate by operation in last 1h βββββββββ
requests
| where timestamp > ago(1h)
| summarize
Total = count(),
Failed = countif(success == false),
P95_ms = percentile(duration, 95),
P99_ms = percentile(duration, 99)
by name
| extend FailureRate = round(100.0 * Failed / Total, 2)
| where Total > 10
| order by FailureRate desc
// ββ Slow dependency calls (SQL > 1s) ββββββββββββββββββββββββββββββββββ
dependencies
| where timestamp > ago(1h)
| where type == "SQL"
| where duration > 1000
| project timestamp, name, data, duration, success, resultCode
| order by duration desc
| take 50
// ββ Top exceptions in last 24h ββββββββββββββββββββββββββββββββββββββββ
exceptions
| where timestamp > ago(24h)
| summarize Count = count() by type, outerMessage
| order by Count desc
| take 20
// ββ User journey β trace one operation end-to-end βββββββββββββββββββββ
let opId = "abc123def456";
union requests, dependencies, exceptions, traces
| where operation_Id == opId
| project timestamp, itemType, name, duration, success, message, type
| order by timestamp asc03Log Analytics
Log Analytics Workspace is Azure's centralised log aggregation and query engine. Every Azure service can ship its diagnostic logs here. Your applications send structured logs via the App Insights SDK or the OTel OTLP exporter. Queries are written in KQL (Kusto Query Language) β a powerful, expressive SQL-like language purpose-built for log analytics.
| Log Source | How to Connect | Key Tables |
|---|---|---|
| Azure App Service | Diagnostic Settings β Log Analytics | AppServiceHTTPLogs, AppServiceConsoleLogs |
| Azure Functions | Diagnostic Settings + AI SDK | FunctionAppLogs, requests, traces |
| Azure Kubernetes Service | Container Insights add-on | ContainerLog, KubePodInventory, KubeEvents |
| Azure API Management | Diagnostic Settings | ApiManagementGatewayLogs |
| Azure Service Bus | Diagnostic Settings | AzureDiagnostics (ResourceType=NAMESPACES) |
| Azure Key Vault | Diagnostic Settings | AzureDiagnostics (ResourceType=VAULTS) |
| Application Insights | Linked workspace (auto) | requests, dependencies, exceptions, traces |
| Custom App | OTel OTLP exporter / DCR API | Custom table or AppTraces |
| Azure Activity Log | Export to workspace | AzureActivity |
| VM / Arc | Azure Monitor Agent | Syslog, SecurityEvent, Event |
03aWorkspace Design
How you structure your Log Analytics workspaces determines your query capabilities, access control boundaries, and cost efficiency. The recommended pattern for most organizations is a centralized workspace per environment with table-level RBAC β this enables cross-service correlation queries while maintaining security isolation. Always link Application Insights to a workspace so you can join APM data with platform logs in a single KQL query. Consider commitment tiers once you exceed 100 GB/day ingestion, as they offer up to 30% savings over pay-as-you-go pricing.
| Decision | Recommendation | Reason |
|---|---|---|
| Workspaces per environment | One per environment (dev/staging/prod) | Isolation β no dev noise in prod queries |
| Workspaces per region | One per region if data residency required | Compliance β data stays in region |
| Centralised vs per-team | Central workspace + table-level RBAC | Cost efficiency + cross-service correlation |
| Retention period | 30 days hot, up to 730 days archive | Balance cost vs forensics requirement |
| Commitment tier | Use commitment tier at >100 GB/day | Up to 30% discount over pay-as-you-go |
| Linked App Insights | Always link AI to a workspace | Unified queries across AI + platform logs |
# Create Log Analytics Workspace
az monitor log-analytics workspace create \
--resource-group myRG \
--workspace-name myPlatformWorkspace \
--location eastus \
--retention-time 90 \
--sku PerGB2018
# Get workspace ID (for App Insights linking)
WORKSPACE_ID=$(az monitor log-analytics workspace show \
--resource-group myRG \
--workspace-name myPlatformWorkspace \
--query id --output tsv)
# Create Application Insights linked to the workspace
az monitor app-insights component create \
--app myOrdersApi \
--resource-group myRG \
--location eastus \
--workspace "$WORKSPACE_ID" \
--kind web
# Enable diagnostic settings for Service Bus β Workspace
az monitor diagnostic-settings create \
--resource "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ServiceBus/namespaces/<ns>" \
--name "sb-diagnostics" \
--workspace "$WORKSPACE_ID" \
--logs '[{"category":"OperationalLogs","enabled":true},{"category":"VNetAndIPFilteringLogs","enabled":true}]' \
--metrics '[{"category":"AllMetrics","enabled":true}]'03bKQL β Core Queries
KQL (Kusto Query Language) is the query language for Log Analytics and Application Insights β it is purpose-built for exploring large volumes of telemetry data with sub-second response times. Unlike SQL, KQL uses a pipe-based syntax where each operator transforms the result set flowing through it, making complex queries readable and composable. Always start with a time filter to limit the data scanned, use has instead of contains for better performance on word-boundary matches, and leverage summarize with bin() for time-series aggregations that render beautifully as charts.
KQL Cheat Sheet β Essential Operators
// ββ Filtering βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TableName
| where TimeGenerated > ago(1h) // Time filter (always first!)
| where Level == "Error" // Exact match
| where Message contains "timeout" // Case-insensitive substring
| where Message has "OrderId" // Faster than contains for words
| where StatusCode between (500 .. 599) // Numeric range
| where isnotempty(CorrelationId) // Not null/empty
| where Properties.region in ("EU", "US") // Value in list
// ββ Projection ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| project TimeGenerated, Level, Message, Properties // Select columns
| project-away TenantId, SubscriptionId // Drop columns
| extend DurationSec = Duration / 1000.0 // Add computed column
| parse Message with "OrderId=" OrderId " amount=" Amount:double " region=" Region
// ββ Aggregation βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| summarize
Count = count(),
ErrorCount = countif(Level == "Error"),
AvgDuration = avg(Duration),
P95Duration = percentile(Duration, 95),
MaxDuration = max(Duration)
by bin(TimeGenerated, 5m), ServiceName
// ββ Sorting & Limiting ββββββββββββββββββββββββββββββββββββββββββββββββ
| order by TimeGenerated desc
| top 100 by Duration desc // Top N by a column
// ββ Joining tables ββββββββββββββββββββββββββββββββββββββββββββββββββββ
requests
| join kind=leftouter (
exceptions
| project operation_Id, exceptionType = type, exceptionMsg = outerMessage
) on operation_Id
// ββ String operations βββββββββββββββββββββββββββββββββββββββββββββββββ
| extend OrderId = extract("OrderId=([a-f0-9-]+)", 1, Message)
| extend Domain = split(Email, "@")[1]
| where ServiceName startswith "order"
| where Url matches regex @"/api/v[0-9]+/orders/[a-f0-9-]+"
// ββ Time operations βββββββββββββββββββββββββββββββββββββββββββββββββββ
| extend HourOfDay = hourofday(TimeGenerated)
| extend DayOfWeek = dayofweek(TimeGenerated)
| where TimeGenerated between (datetime(2026-05-01) .. datetime(2026-05-08))Essential Production Queries
// ββ Error rate trend (5-min buckets) βββββββββββββββββββββββββββββββββ
requests
| where TimeGenerated > ago(3h)
| summarize
Total = count(),
Failed = countif(success == false)
by bin(TimeGenerated, 5m), cloud_RoleName
| extend ErrorRate = round(100.0 * Failed / Total, 2)
| render timechart
// ββ P50/P95/P99 latency by endpoint ββββββββββββββββββββββββββββββββββ
requests
| where TimeGenerated > ago(1h)
| summarize
P50 = percentile(duration, 50),
P95 = percentile(duration, 95),
P99 = percentile(duration, 99),
Count = count()
by name
| where Count > 100
| order by P99 desc
// ββ Dependency failures β which external calls are breaking? ββββββββββ
dependencies
| where TimeGenerated > ago(1h)
| where success == false
| summarize
FailureCount = count(),
AvgDuration = avg(duration),
Targets = make_set(target)
by type, name
| order by FailureCount desc
// ββ Exceptions β top 10 with sample messages βββββββββββββββββββββββββ
exceptions
| where TimeGenerated > ago(24h)
| summarize
Count = count(),
Sample = any(outerMessage),
LastSeen = max(TimeGenerated)
by type
| top 10 by Count desc
// ββ Slow SQL queries ββββββββββββββββββββββββββββββββββββββββββββββββββ
dependencies
| where TimeGenerated > ago(1h)
| where type == "SQL"
| where duration > 500
| project TimeGenerated, name, data, duration, resultCode
| order by duration desc
| take 20
// ββ Container restart loop detection βββββββββββββββββββββββββββββββββ
ContainerLog
| where TimeGenerated > ago(1h)
| where LogEntry contains "OOMKilled" or LogEntry contains "CrashLoopBackOff"
| summarize Restarts = count() by ContainerID, _ResourceId
| where Restarts > 3
| order by Restarts desc03cKQL β Advanced Patterns
Advanced KQL patterns unlock powerful capabilities like anomaly detection, cross-workspace correlation, funnel analysis, and dynamic JSON parsing β these are the queries that power production alert rules and executive dashboards. Error spike detection compares current error counts against a rolling baseline to catch regressions without hardcoded thresholds. Cross-workspace queries let you correlate security events with application failures, while funnel analysis tracks user drop-off through multi-step business flows. Master these patterns and you can answer almost any operational or business question from your telemetry data alone.
// ββ Alerting query: error spike detection (> 2x baseline) βββββββββββ
let baseline = requests
| where TimeGenerated between (ago(2h) .. ago(1h))
| summarize BaselineErrors = countif(success == false);
let current = requests
| where TimeGenerated > ago(1h)
| summarize CurrentErrors = countif(success == false);
current
| cross join baseline
| where CurrentErrors > BaselineErrors * 2 and CurrentErrors > 10
| project CurrentErrors, BaselineErrors,
Increase = round(100.0 * (CurrentErrors - BaselineErrors) / BaselineErrors, 1)
// ββ User impact analysis β how many users hit errors? βββββββββββββββββ
requests
| where TimeGenerated > ago(1h)
| where success == false
| summarize
FailedRequests = count(),
AffectedUsers = dcount(user_Id),
AffectedOps = dcount(operation_Id)
by name, resultCode
| order by AffectedUsers desc
// ββ Funnel analysis β order creation drop-off βββββββββββββββββββββββββ
let step1 = customEvents | where name == "CartViewed" | summarize s1 = dcount(user_Id);
let step2 = customEvents | where name == "CheckoutStarted" | summarize s2 = dcount(user_Id);
let step3 = customEvents | where name == "OrderSubmitted" | summarize s3 = dcount(user_Id);
let step4 = customEvents | where name == "OrderConfirmed" | summarize s4 = dcount(user_Id);
step1 | cross join step2 | cross join step3 | cross join step4
| project
CartViewed = s1,
CheckoutStarted = s2,
OrderSubmitted = s3,
OrderConfirmed = s4,
CheckoutConversion = round(100.0 * s2 / s1, 1),
SubmitConversion = round(100.0 * s3 / s2, 1),
FinalConversion = round(100.0 * s4 / s1, 1)
// ββ Cross-workspace query β correlate App Insights + Security logs ββββ
workspace("SecurityWorkspace").SecurityEvent
| where TimeGenerated > ago(1h)
| where EventID == 4625 // Failed login
| join kind=inner (
app("OrdersAppInsights").requests
| where success == false
| project operation_Id, clientIP = client_IP
) on $left.IpAddress == $right.clientIP
| project TimeGenerated, IpAddress, Account, operation_Id
// ββ Dynamic columns from JSON properties ββββββββββββββββββββββββββββββ
customEvents
| where name == "OrderCreated"
| extend
OrderId = tostring(customDimensions["OrderId"]),
Amount = todouble(customDimensions["OrderAmount"]),
Region = tostring(customDimensions["Region"])
| summarize Revenue = sum(Amount), Orders = count() by Region, bin(TimeGenerated, 1h)
| render timechart03dKey Log Tables Reference
Every Azure service and Application Insights telemetry type maps to a specific table in Log Analytics β knowing which table to query is half the battle during incident response. The requests and dependencies tables are your go-to for API performance, while AzureDiagnostics is the catch-all for platform resource logs filtered by ResourceType. Bookmark this reference so you can jump straight to the right table when debugging production issues instead of guessing table names under pressure.
| Table | Source | Key Columns |
|---|---|---|
| requests | App Insights | name, duration, success, resultCode, url, operation_Id, cloud_RoleName |
| dependencies | App Insights | name, type, target, data, duration, success, resultCode, operation_Id |
| exceptions | App Insights | type, outerMessage, innermostMessage, stack, operation_Id, cloud_RoleName |
| traces | App Insights | message, severityLevel, customDimensions, operation_Id, cloud_RoleName |
| customEvents | App Insights | name, customDimensions, customMeasurements, operation_Id, user_Id |
| customMetrics | App Insights | name, value, valueCount, valueSum, valueMin, valueMax |
| availabilityResults | App Insights | name, success, duration, location, message, runLocation |
| AzureDiagnostics | Azure Resources | ResourceType, OperationName, ResultType, Level, CallerIpAddress |
| AzureActivity | Azure RBAC / Control | OperationName, Caller, ResourceGroup, ActivityStatus, Level |
| ContainerLog | AKS / Container Insights | ContainerID, LogEntry, LogEntrySource, _ResourceId |
| KubePodInventory | AKS | PodName, Namespace, PodStatus, ContainerStatusReason, Node |
| KubeEvents | AKS | Name, Namespace, Reason, Message, KubeEventType |
| AppServiceHTTPLogs | App Service | CsMethod, CsUriStem, ScStatus, TimeTaken, CIp |
| FunctionAppLogs | Azure Functions | HostInstanceId, Message, ExceptionMessage, FunctionName, Level |
| ApiManagementGatewayLogs | APIM | ApiId, OperationId, ResponseCode, TotalTime, ClientIp |
| SigninLogs | Entra ID | UserPrincipalName, IPAddress, ResultType, ConditionalAccessStatus |
04Azure Monitor
Azure Monitor is the platform-level observability service β it automatically collects metrics from every Azure resource (no agent needed), evaluates alert rules, triggers action groups, and aggregates data into workbooks and dashboards. It is the umbrella that App Insights and Log Analytics sit within.
| Azure Monitor Feature | Description |
|---|---|
| Platform Metrics | Auto-collected numeric metrics from every Azure resource β CPU, requests, queue depth, etc. |
| Custom Metrics | Emit your own time-series from apps via OTel or App Insights SDK |
| Alert Rules | Evaluate metric/log conditions and fire on threshold breach |
| Action Groups | Notifications and automation triggered by alerts β email, SMS, webhook, PagerDuty, Logic App |
| Workbooks | Interactive parameterised reports mixing KQL, metrics, and markdown |
| Dashboards | Pinnable metric charts and tiles for NOC-style displays |
| Autoscale | Scale Azure resources (App Service, VMSS) based on metric thresholds |
| Change Analysis | Track infrastructure changes correlated with incidents |
| Service Health | Azure platform incidents affecting your specific resources |
04aMetrics & Dimensions
Azure Monitor automatically collects platform metrics from every resource β CPU, memory, request counts, queue depths β with no agent or SDK required. These numeric time-series are stored for 93 days and can be queried with sub-minute granularity, making them ideal for real-time dashboards and alert rules. For application-specific measurements, you can emit custom metrics via the OpenTelemetry Meter API using counters, histograms, and gauges with dimensional tags that enable powerful filtering and grouping in metric explorer.
Key Metrics Per Service
| Service | Critical Metrics to Monitor |
|---|---|
| App Service / Functions | CpuPercentage, MemoryWorkingSet, HttpResponseTime, HttpServerErrors, Requests |
| Azure SQL / SQL MI | cpu_percent, dtu_consumption_percent, deadlock, connection_failed, storage_percent |
| Azure Service Bus | ActiveMessages, DeadletteredMessages, IncomingMessages, ThrottledRequests, ServerErrors |
| Azure Event Hubs | IncomingMessages, OutgoingMessages, ThrottledRequests, IncomingBytes, CapturedMessages |
| Azure Event Grid | PublishSuccessCount, DeliverySuccessCount, DeadLetteredCount, DroppedEventCount |
| Azure Key Vault | ServiceApiHit, ServiceApiLatency, ServiceApiResult (failures) |
| Azure API Management | TotalRequests, SuccessfulRequests, FailedRequests, Duration, Capacity |
| Azure Container Apps | Replicas, CpuUsage, MemoryUsage, RequestCount, ResponseTime |
| AKS | node_cpu_usage_percentage, node_memory_working_set_percentage, kube_pod_status_phase |
| Azure Cosmos DB | TotalRequests, TotalRequestUnits, ServerSideLatency, NormalizedRUConsumption |
Custom Metrics via OpenTelemetry (.NET 8)
using System.Diagnostics.Metrics;
// Define meters and instruments (static β created once)
public static class OrdersMetrics
{
private static readonly Meter Meter = new("Orders.Api", "1.0.0");
// Counter β monotonically increasing
public static readonly Counter<long> OrdersCreated =
Meter.CreateCounter<long>(
name: "orders.created.total",
unit: "orders",
description: "Total number of orders created");
// Histogram β distribution of values (latency, sizes)
public static readonly Histogram<double> OrderProcessingDuration =
Meter.CreateHistogram<double>(
name: "orders.processing.duration",
unit: "ms",
description: "Time to process an order end-to-end");
// ObservableGauge β current value from a callback
public static readonly ObservableGauge<int> PendingOrders =
Meter.CreateObservableGauge<int>(
name: "orders.pending.count",
observeValue: () => OrderRepository.GetPendingCount(),
unit: "orders",
description: "Current number of pending orders");
// UpDownCounter β can increase and decrease
public static readonly UpDownCounter<int> ActiveConnections =
Meter.CreateUpDownCounter<int>(
name: "orders.active.connections",
unit: "connections");
}
// Usage in command handler
public async Task<Result<OrderResponse>> Handle(CreateOrderCommand cmd, CancellationToken ct)
{
var sw = Stopwatch.StartNew();
try
{
var order = Order.Create(cmd.CustomerId, cmd.LineItems);
await _repository.SaveAsync(order, ct);
// Record with dimensions (tags)
OrdersMetrics.OrdersCreated.Add(1,
new KeyValuePair<string, object?>("region", cmd.Region),
new KeyValuePair<string, object?>("channel", cmd.Channel),
new KeyValuePair<string, object?>("priority", cmd.IsPriority ? "high" : "normal"));
return Result.Success(OrderResponse.From(order));
}
finally
{
sw.Stop();
OrdersMetrics.OrderProcessingDuration.Record(sw.Elapsed.TotalMilliseconds,
new KeyValuePair<string, object?>("region", cmd.Region));
}
}
// Register meters with OTel
builder.Services.AddOpenTelemetry()
.WithMetrics(m => m
.AddMeter("Orders.Api")
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation() // GC, thread pool, heap
.AddProcessInstrumentation() // CPU, memory
.AddOtlpExporter() // β Azure Monitor / Prometheus
.AddPrometheusExporter()); // /metrics endpoint for Prometheus scraping04bAlert Rules & Action Groups
Alert rules are the automated sentinels of your production environment β they continuously evaluate conditions against your metrics and logs, firing notifications when thresholds are breached. Action groups define who gets notified and how: email, SMS, webhook to PagerDuty, or even triggering a Logic App for automated remediation. The key to effective alerting is combining metric alerts for real-time threshold detection with log search alerts for complex pattern matching, while keeping severity levels aligned with actual user impact to prevent alert fatigue.
| Alert Type | Condition Evaluated | Best For |
|---|---|---|
| Metric Alert | Numeric metric crosses threshold | CPU > 80%, error rate > 5%, queue depth > 1000 |
| Log Search Alert | KQL query returns rows | DLQ messages, specific error patterns, security events |
| Activity Log Alert | Azure control-plane event occurs | Resource deleted, role assigned, policy violated |
| Smart Detection | AI-detected anomalies in App Insights | Sudden increase in exceptions, degraded response time |
| Resource Health Alert | Azure resource enters unhealthy state | VM unavailable, SQL inaccessible, App Service down |
Create Alert Rule via CLI
# ββ Step 1: Create Action Group (who gets notified) βββββββββββββββββ
az monitor action-group create \
--resource-group myRG \
--name platform-oncall \
--short-name oncall \
--action email oncall-email ops-team@contoso.com \
--action webhook pagerduty https://events.pagerduty.com/integration/<key>/enqueue
# ββ Step 2: Metric Alert β Service Bus DLQ > 0 ββββββββββββββββββββββ
az monitor metrics alert create \
--name "ServiceBus-DLQ-Alert" \
--resource-group myRG \
--scopes "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.ServiceBus/namespaces/<ns>" \
--condition "avg DeadletteredMessages > 0" \
--window-size 5m \
--evaluation-frequency 1m \
--severity 2 \
--action "/subscriptions/<sub>/resourceGroups/myRG/providers/microsoft.insights/actionGroups/platform-oncall" \
--description "Messages in Service Bus dead-letter queue"
# ββ Step 3: Log Alert β 5xx errors spike βββββββββββββββββββββββββββββ
az monitor scheduled-query create \
--name "High-Error-Rate-Alert" \
--resource-group myRG \
--scopes "<app-insights-resource-id>" \
--condition-query "requests | where timestamp > ago(5m) | summarize FailRate = countif(success==false)*100.0/count() | where FailRate > 5" \
--condition-time-aggregation Count \
--condition-operator GreaterThan \
--condition-threshold 0 \
--evaluation-period 5 \
--evaluation-frequency 1 \
--severity 1 \
--action-groups "/subscriptions/<sub>/resourceGroups/myRG/providers/microsoft.insights/actionGroups/platform-oncall"Production Alert Rules Starter Pack
| Alert Name | Metric / Query | Threshold | Severity |
|---|---|---|---|
| High Error Rate | requests β failure rate | > 5% over 5 min | Sev 1 |
| P99 Latency Spike | requests β P99 duration | > 3000 ms | Sev 2 |
| DLQ Messages | DeadletteredMessages | > 0 | Sev 2 |
| CPU Sustained High | CpuPercentage | > 85% for 15 min | Sev 2 |
| Memory Near Limit | MemoryWorkingSet | > 90% of limit | Sev 2 |
| Availability Failed | availabilityResults β success==false | Any failure | Sev 1 |
| Key Vault Unauthorized | AzureDiagnostics β ResultType!=Success | Any | Sev 1 |
| Pod CrashLoopBackOff | KubeEvents β Reason==BackOff | Any | Sev 2 |
| Storage Near Capacity | UsedCapacity | > 80% of limit | Sev 3 |
| Exception Spike | exceptions β count vs 1h baseline | > 2x baseline | Sev 2 |
04cWorkbooks & Dashboards
Workbooks are interactive, parameterised reports that combine KQL queries, Azure Metrics, markdown text, and ARM data into a single document. They are ideal for SLO reports, incident postmortems, capacity planning, and team-facing health dashboards.
// Workbook ARM template structure (simplified)
{
"type": "microsoft.insights/workbooks",
"properties": {
"displayName": "Orders API β SLO Dashboard",
"serializedData": {
"version": "Notebook/1.0",
"items": [
{
"type": 9, // Parameters
"content": {
"parameters": [
{
"id": "timeRange",
"version": "KqlParameterItem/1.0",
"name": "TimeRange",
"type": 4,
"value": { "durationMs": 3600000 }
},
{
"id": "service",
"name": "ServiceName",
"type": 2, // Drop-down
"query": "requests | summarize by cloud_RoleName"
}
]
}
},
{
"type": 3, // Query item (chart)
"content": {
"query": "requests | where cloud_RoleName == '{ServiceName}' | summarize ErrorRate = countif(success==false)*100.0/count() by bin(TimeGenerated, 5m) | render timechart",
"visualization": "timechart",
"title": "Error Rate % β {ServiceName}"
}
}
]
}
}
}04dDiagnostic Settings
Diagnostic settings control where an Azure resource sends its platform logs and metrics. Every production resource should have diagnostic settings routing to a Log Analytics workspace.
// Bicep β apply diagnostic settings to every resource via module
module diagnostics 'modules/diagnostics.bicep' = {
name: 'orders-api-diagnostics'
params: {
resourceId: ordersAppService.id
workspaceId: logAnalyticsWorkspace.id
retentionDays: 90
}
}
// modules/diagnostics.bicep
resource diagnosticSetting 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
name: 'send-to-workspace'
scope: resourceId
properties: {
workspaceId: workspaceId
logs: [
{ categoryGroup: 'allLogs', enabled: true, retentionPolicy: { days: retentionDays, enabled: true } }
]
metrics: [
{ category: 'AllMetrics', enabled: true, retentionPolicy: { days: retentionDays, enabled: true } }
]
}
}Deploy-Diagnostics initiative) so every new resource automatically ships logs to your workspace.05Distributed Tracing
Distributed tracing answers the question: "What happened to this request as it moved across my 12 microservices?" Each service creates a span (a named, timed operation). Spans are linked by a sharedtrace-id propagated in HTTP headers and message properties. The resulting tree of spans is a trace β showing the full causal chain, timing, and errors.
| Concept | Definition |
|---|---|
| Trace | The entire journey of one request across all services β a tree of spans |
| Span | A single named, timed operation within one service (HTTP handler, DB query, cache lookup) |
| TraceId | 128-bit ID unique to one trace β shared by all spans in that trace |
| SpanId | 64-bit ID unique to one span β used as parent reference by child spans |
| ParentSpanId | The SpanId of the span that created this one β builds the tree structure |
| Baggage | Key/value pairs propagated through the entire trace β for cross-cutting context |
| W3C TraceContext | HTTP header standard: traceparent + tracestate β the propagation format |
| OTLP | OpenTelemetry Protocol β standard wire format for exporting traces, metrics, logs |
| Sampling | Decision to record or discard a trace β head-based (at trace start) or tail-based |
05aOpenTelemetry in .NET 8
OpenTelemetry (OTel) is the vendor-neutral, CNCF-backed standard for instrumentation β it lets you collect traces, metrics, and logs with a single SDK and export to any backend (Azure Monitor, Jaeger, Datadog, Grafana). In .NET 8, OTel is a first-class citizen with native support via System.Diagnostics.Activity for tracing and System.Diagnostics.Metrics for metrics. The setup configures resource metadata (service name, version, environment), adds auto-instrumentation for ASP.NET Core, HttpClient, and EF Core, then exports via OTLP β giving you full observability with minimal code changes.
Full OTel Setup β Tracing + Metrics + Logs
// NuGet packages needed:
// OpenTelemetry.Extensions.Hosting
// OpenTelemetry.Instrumentation.AspNetCore
// OpenTelemetry.Instrumentation.Http
// OpenTelemetry.Instrumentation.EntityFrameworkCore
// OpenTelemetry.Instrumentation.Runtime
// OpenTelemetry.Exporter.OpenTelemetryProtocol
// Azure.Monitor.OpenTelemetry.AspNetCore
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetry()
// ββ Resource metadata (common tags on all signals) βββββββββββββ
.ConfigureResource(resource => resource
.AddService(
serviceName: "orders-api",
serviceVersion: "2.1.0",
serviceInstanceId: Environment.MachineName)
.AddAttributes(new Dictionary<string, object>
{
["deployment.environment"] = builder.Environment.EnvironmentName,
["service.namespace"] = "com.myplatform",
["k8s.pod.name"] = Environment.GetEnvironmentVariable("HOSTNAME") ?? "",
}))
// ββ Tracing ββββββββββββββββββββββββββββββββββββββββββββββββββββ
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.EnrichWithHttpRequest = (activity, request) =>
{
activity.SetTag("http.request.body_size",
request.ContentLength ?? 0);
};
options.EnrichWithHttpResponse = (activity, response) =>
{
activity.SetTag("http.response.body_size",
response.ContentLength ?? 0);
};
// Don't trace health checks β too noisy
options.Filter = ctx =>
!ctx.Request.Path.StartsWithSegments("/health");
})
.AddHttpClientInstrumentation(options =>
{
options.RecordException = true;
// Redact auth headers from traces
options.EnrichWithHttpRequestMessage = (activity, request) =>
{
request.Headers.Remove("Authorization");
};
})
.AddEntityFrameworkCoreInstrumentation(options =>
{
options.SetDbStatementForText = true; // Capture SQL
options.SetDbStatementForStoredProcedure = true;
})
.AddSource("Orders.Api") // Custom ActivitySource
.AddSource("MassTransit") // MassTransit instrumentation
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri(
builder.Configuration["Otel:Endpoint"] ?? "http://localhost:4317");
options.Protocol = OtlpExportProtocol.Grpc;
}))
// ββ Metrics ββββββββββββββββββββββββββββββββββββββββββββββββββββ
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddProcessInstrumentation()
.AddMeter("Orders.Api")
.AddOtlpExporter())
// ββ Logging ββββββββββββββββββββββββββββββββββββββββββββββββββββ
.WithLogging(logging => logging
.AddOtlpExporter());
// Also export to Azure Monitor (Application Insights) simultaneously
builder.Services.AddOpenTelemetry().UseAzureMonitor();Custom ActivitySource β Manual Spans
// Define ActivitySource (one per service or logical domain)
public static class OrdersTracing
{
public static readonly ActivitySource Source =
new("Orders.Api", "2.1.0");
}
// Create manual spans for business operations
public async Task<Result<OrderResponse>> Handle(
CreateOrderCommand cmd, CancellationToken ct)
{
// Start a new span β child of the current ambient activity
using var activity = OrdersTracing.Source.StartActivity(
"CreateOrder",
ActivityKind.Internal);
// Tag the span with business context
activity?.SetTag("order.customer_id", cmd.CustomerId.ToString());
activity?.SetTag("order.region", cmd.Region);
activity?.SetTag("order.item_count", cmd.LineItems.Count);
activity?.SetTag("order.channel", cmd.Channel);
try
{
// Nested span for inventory validation
using var inventorySpan = OrdersTracing.Source.StartActivity(
"ValidateInventory",
ActivityKind.Client);
inventorySpan?.SetTag("inventory.product_count", cmd.LineItems.Count);
var inventory = await _inventoryClient.ValidateAsync(cmd.LineItems, ct);
inventorySpan?.SetTag("inventory.all_available", inventory.AllAvailable);
if (!inventory.AllAvailable)
{
activity?.SetStatus(ActivityStatusCode.Error, "Insufficient inventory");
return Result.Failure<OrderResponse>("Some items are out of stock");
}
var order = Order.Create(cmd.CustomerId, cmd.LineItems);
activity?.SetTag("order.id", order.Id.ToString());
// Add an event (point-in-time annotation on the span)
activity?.AddEvent(new ActivityEvent("OrderPersisted", tags:
new ActivityTagsCollection { ["order.id"] = order.Id.ToString() }));
activity?.SetStatus(ActivityStatusCode.Ok);
return Result.Success(OrderResponse.From(order));
}
catch (Exception ex)
{
// Record exception on span
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.RecordException(ex);
throw;
}
}05bContext Propagation
For a distributed trace to work, the trace context must travel with the request across every hop β HTTP calls, message queue messages, gRPC calls. The W3C TraceContext standard defines how this works via HTTP headers. OTel handles this automatically for instrumented libraries.
// W3C traceparent header format
traceparent: 00-<trace-id>-<parent-span-id>-<flags>
// Example:
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
// ^ ^ ^ ^
// version 128-bit trace ID 64-bit span ID flags (01=sampled)
// W3C tracestate β vendor-specific data
tracestate: az=<app-insights-specific-data>// ββ HttpClient propagates context automatically with OTel βββββββββββ
// Nothing to do β OTel HttpClient instrumentation injects traceparent
// ββ Manual propagation for Service Bus messages βββββββββββββββββββββββ
// Sender β inject current trace context into message properties
public async Task SendOrderEventAsync(OrderCreatedEvent evt)
{
var message = new ServiceBusMessage(JsonSerializer.SerializeToUtf8Bytes(evt))
{
MessageId = evt.OrderId.ToString(),
ContentType = "application/json",
};
// Inject W3C trace context into message ApplicationProperties
var propagator = Propagators.DefaultTextMapPropagator;
propagator.Inject(
new PropagationContext(Activity.Current?.Context ?? default, Baggage.Current),
message.ApplicationProperties,
(props, key, value) => props[key] = value);
await _sender.SendMessageAsync(message);
}
// Receiver β extract trace context from message and set as parent
public async Task ProcessMessageAsync(ServiceBusReceivedMessage message)
{
// Extract context from message properties
var propagator = Propagators.DefaultTextMapPropagator;
var parentContext = propagator.Extract(
default,
message.ApplicationProperties,
(props, key) =>
{
if (props.TryGetValue(key, out var value))
return [value?.ToString() ?? ""];
return [];
});
// Start child span linked to the sender's trace
using var activity = OrdersTracing.Source.StartActivity(
"ServiceBus.ProcessOrder",
ActivityKind.Consumer,
parentContext.ActivityContext);
activity?.SetTag("messaging.system", "servicebus");
activity?.SetTag("messaging.destination", "orders");
activity?.SetTag("messaging.message_id", message.MessageId);
await ProcessOrderAsync(message.Body.ToString());
}Baggage β Cross-Service Context
// Set baggage at entry point (API Gateway / BFF)
Baggage.Current = Baggage.SetBaggage("tenant.id", request.TenantId);
Baggage.Current = Baggage.SetBaggage("user.id", user.GetUserId());
Baggage.Current = Baggage.SetBaggage("feature.flags","beta-checkout=true");
// Read baggage anywhere in the call chain (propagated automatically)
var tenantId = Baggage.GetBaggage("tenant.id");
var featureFlags = Baggage.GetBaggage("feature.flags");
// All spans in this trace automatically inherit baggage values
// Useful for: multi-tenant filtering, feature flag tracking,
// A/B test attribution, compliance tagging05cTrace Sampling Strategies
At scale, recording every single trace is prohibitively expensive β a service handling 10,000 requests per second generates terabytes of trace data daily. Sampling strategies let you keep costs manageable while preserving visibility into errors and slow requests. Head-based sampling decides at trace start whether to record (simple but may miss rare errors), while tail-based sampling buffers complete traces and keeps only interesting ones (errors, slow requests) β more powerful but requires additional infrastructure. The best production strategy combines ratio-based sampling for normal traffic with always-on sampling for critical paths like payments and checkout.
| Strategy | Decision Point | Pros | Cons |
|---|---|---|---|
| AlwaysOn | Head (trace start) | See everything | Huge data volume and cost at scale |
| AlwaysOff | Head | Zero cost | No visibility at all β dev only |
| TraceIdRatio (N%) | Head | Predictable volume, simple config | May miss infrequent errors |
| ParentBased | Head | Respects upstream sampling decision | Can't override a downstream service |
| Tail-Based (Jaeger) | Tail (after trace) | Always keeps error traces | Requires full trace buffering, more infra |
| Adaptive (App Insights) | Head (dynamic) | Auto-adjusts to traffic volume | Proprietary β App Insights only |
// ββ Head-based sampling via OTel SDK βββββββββββββββββββββββββββββββββ
// 1. Always sample errors, sample 10% of successful traces
builder.Services.AddOpenTelemetry()
.WithTracing(t => t
.SetSampler(new CompositeApplicationSampler(
// Always sample if there's an error on the span
errorSampler: new AlwaysOnSampler(),
// 10% of everything else
defaultSampler: new TraceIdRatioBasedSampler(0.10)
))
);
// Custom sampler β always sample slow requests
public class SlowRequestSampler : Sampler
{
private readonly double _threshold;
public SlowRequestSampler(double threshold) => _threshold = threshold;
public override SamplingResult ShouldSample(in SamplingParameters parameters)
{
// Always sample if parent says so (preserve parent decision)
if (parameters.ParentContext.TraceFlags.HasFlag(ActivityTraceFlags.Recorded))
return new SamplingResult(SamplingDecision.RecordAndSample);
// Sample based on URL priority
var httpPath = parameters.Tags?
.FirstOrDefault(t => t.Key == "http.target").Value?.ToString();
// Always trace payment and checkout flows
if (httpPath?.Contains("/checkout") == true ||
httpPath?.Contains("/payments") == true)
return new SamplingResult(SamplingDecision.RecordAndSample);
// 5% sample for everything else
return new SamplingResult(
Random.Shared.NextDouble() < 0.05
? SamplingDecision.RecordAndSample
: SamplingDecision.Drop);
}
}05dJaeger & Trace Backends
Your trace data needs a backend to store, index, and visualize it β and you have several options depending on your environment and budget. Azure Monitor (Application Insights) is the native choice for Azure workloads with zero infrastructure to manage, while Jaeger and Grafana Tempo are excellent open-source alternatives for local development, staging environments, or multi-cloud setups. The beauty of OpenTelemetry is that you can switch backends by changing exporter configuration alone β your instrumentation code stays the same. Use Jaeger locally via Docker for instant trace visualization during development, then export to Azure Monitor in production.
| Backend | Type | Best For | Azure Integration |
|---|---|---|---|
| Azure Monitor (App Insights) | Managed SaaS | Full Azure-native stack, E2E view | Native β UseAzureMonitor() |
| Jaeger | Open source | Dev/staging, self-hosted tracing | Deploy on AKS, OTLP ingest |
| Grafana Tempo | Open source | Cost-effective, Grafana-native | Deploy on AKS, query in Grafana |
| Zipkin | Open source | Legacy, simple setup | OTLP export via OTel collector |
| Datadog APM | Commercial SaaS | Cross-cloud, ML anomaly detection | OTLP export to Datadog agent |
| Honeycomb | Commercial SaaS | High-cardinality event analytics | OTLP export |
Local Dev with Jaeger via Docker
# docker-compose.override.yml β dev environment
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
environment:
- COLLECTOR_OTLP_ENABLED=true
networks: [platform]
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4319:4317" # Receive from apps
networks: [platform]
# otel-collector-config.yaml
# receivers:
# otlp:
# protocols: { grpc: { endpoint: 0.0.0.0:4317 } }
# exporters:
# jaeger:
# endpoint: jaeger:14250
# azuremonitor:
# connection_string: "${AI_CONNECTION_STRING}"
# service:
# pipelines:
# traces:
# receivers: [otlp]
# exporters: [jaeger, azuremonitor]// appsettings.Development.json β point to local Jaeger
{
"Otel": {
"Endpoint": "http://localhost:4317"
},
"ApplicationInsights": {
"ConnectionString": ""
}
}
// appsettings.Production.json β Azure Monitor
{
"Otel": {
"Endpoint": "https://eastus-8.in.applicationinsights.azure.com/"
},
"ApplicationInsights": {
"ConnectionString": "@Microsoft.KeyVault(VaultName=myVault;SecretName=ai-conn-string)"
}
}06SLIs, SLOs & Error Budgets
SLOs (Service Level Objectives) are the bridge between engineering and business. They define what "good enough" looks like for your service in terms of measurable user experience.
| Term | Definition | Example |
|---|---|---|
| SLI (Indicator) | The metric you measure | P99 latency of /api/orders POST |
| SLO (Objective) | The target level for the SLI | P99 latency < 500ms for 99.5% of requests over 30 days |
| SLA (Agreement) | Contract with users/customers β legal commitment | 99.9% uptime per month |
| Error Budget | SLO budget you can spend on incidents and deployments | 0.5% of requests allowed to fail in 30 days |
| Burn Rate | How fast you're consuming error budget | 2x burn rate = budget gone in 15 days not 30 |
SLO Queries in KQL
// ββ Availability SLO β % of successful requests ββββββββββββββββββββββ
let sloWindow = 30d;
let sloTarget = 99.5; // 99.5% availability
requests
| where TimeGenerated > ago(sloWindow)
| where name !startswith "GET /health" // Exclude health checks from SLO
| summarize
Total = count(),
Success = countif(success == true)
| extend
AvailabilitySLI = round(100.0 * Success / Total, 3),
ErrorBudget_pct = 100.0 - sloTarget,
ActualErrors = Total - Success,
BudgetedErrors = round(Total * (1 - sloTarget / 100.0), 0),
BudgetRemaining = round(Total * (1 - sloTarget / 100.0), 0) - (Total - Success)
| extend SLO_Met = AvailabilitySLI >= sloTarget
// ββ Latency SLO β P99 must be under 500ms βββββββββββββββββββββββββββββ
requests
| where TimeGenerated > ago(30d)
| where name == "POST /api/v1/orders"
| summarize
TotalRequests = count(),
WithinTarget = countif(duration < 500),
P50_ms = percentile(duration, 50),
P95_ms = percentile(duration, 95),
P99_ms = percentile(duration, 99)
| extend
SLI_latency = round(100.0 * WithinTarget / TotalRequests, 3),
SLO_target = 99.0,
SLO_Met = (100.0 * WithinTarget / TotalRequests) >= 99.0
// ββ Error budget burn rate alert ββββββββββββββββββββββββββββββββββββββ
// Alert when burning through budget more than 2x faster than sustainable
let budget_pct = 0.5; // 0.5% of requests can fail
let burn_window = 1h;
let total_window = 30d;
let totalRequests = toscalar(requests | where TimeGenerated > ago(total_window) | count);
let budgetTotal = totalRequests * (budget_pct / 100.0);
let recentErrors = toscalar(
requests | where TimeGenerated > ago(burn_window) | where success == false | count);
let sustainableBurnPerHour = budgetTotal / (30.0 * 24.0);
recentErrors > sustainableBurnPerHour * 2 // Alert: burning 2x too fast07Alerting Patterns & Runbooks
Good alerting is the difference between catching incidents in minutes versus hours β but bad alerting creates noise that trains your team to ignore pages. The golden rule is to alert on user-visible symptoms (error rate, latency, availability) rather than internal causes (CPU, memory), and every alert must have a clear runbook explaining what to do when it fires. Combine multi-window burn rate alerts for SLO-based detection with deadman's switches for silence detection, and tier severity levels so Sev1 pages on-call immediately while Sev3 creates a ticket for next business day.
Runbook Template
## Alert: High Error Rate β Orders API
**Severity:** Sev 1
**SLO Impact:** Yes β consuming error budget at >2x sustainable rate
### What this means
More than 5% of POST /api/v1/orders requests are failing over the last 5 minutes.
Users cannot place orders.
### Immediate Diagnosis (< 5 minutes)
1. Check Application Map in App Insights β which dependency is failing?
Link: https://portal.azure.com/#resource/<ai-resource-id>/applicationMap
2. Run this query in Log Analytics:
```kql
exceptions
| where timestamp > ago(30m)
| where cloud_RoleName == "orders-api"
| summarize count() by outerMessage
| order by count_ desc
```
3. Check recent deployments:
Link: https://dev.azure.com/<org>/<project>/_release
### Common Causes & Fixes
| Symptom | Cause | Fix |
|---------|-------|-----|
| SQL timeout errors | DB CPU spike | Scale up SQL, check missing indexes |
| 503 from inventory-api | Inventory service down | Check inventory pod status in AKS |
| 401 from payment-api | Token expiry / MI issue | Restart pod, check MI role assignments |
| OutOfMemory | Memory leak in new deploy | Roll back deployment |
### Escalation
- Engineering on-call: PagerDuty β orders-api rotation
- Incident commander: Slack #incidents08Comparison & Decision Tables
With multiple overlapping services in the Azure monitoring stack, knowing which tool to reach for in each situation saves valuable time during incidents and planning. Application Insights is your APM layer for request-level performance, Log Analytics is your query engine for cross-service log correlation, Azure Monitor Metrics gives you real-time numeric dashboards, and distributed tracing shows the full request path across microservices. Use these decision tables as a quick reference when you need to answer a specific observability question β the right tool for the right job means faster resolution and better insights.
When to Use Each Service
| Question | Answer / Service |
|---|---|
| I want to see all HTTP requests and response times for my API | Application Insights β Requests table |
| I want to see the SQL queries my app is running and how slow they are | Application Insights β Dependencies table |
| I want to search across all service logs including platform events | Log Analytics β KQL across all tables |
| I want a chart of CPU and memory over time | Azure Monitor Metrics β metric explorer |
| I want to be paged when error rate exceeds 5% | Azure Monitor Alert Rule (log search or metric) |
| I want to see the full path of one slow request across 5 microservices | Distributed Tracing β App Insights E2E or Jaeger |
| I want to track a custom business event like 'OrderPlaced' | Application Insights β TrackEvent / customEvents |
| I want to query Key Vault access audit logs | Log Analytics β AzureDiagnostics where ResourceType==VAULTS |
| I want a dashboard showing SLOs for 5 services | Azure Monitor Workbook or Grafana |
| I want to detect anomalous exception spikes automatically | Application Insights Smart Detection |
| I want Prometheus metrics for my K8s workloads | Azure Monitor Metrics + Container Insights + Prometheus scraping |
Logging Levels β What to Log at Each Level
| Level | ILogger Method | Use For | Production Volume |
|---|---|---|---|
| Critical | LogCritical | App cannot continue β unrecoverable failure | Rare β immediate page |
| Error | LogError | Operation failed β exception, 5xx, data corruption | Low β alert on any |
| Warning | LogWarning | Unexpected but handled β retry, DLQ, validation fail | Medium β trend on |
| Information | LogInformation | Normal business flow β request received, order created | High β sample in prod |
| Debug | LogDebug | Diagnostic detail β cache hit/miss, intermediate values | None in prod |
| Trace | LogTrace | Very verbose β method entry/exit, every loop iteration | Never in prod |
09Quick Reference Cheat Sheet
This cheat sheet distills the most commonly used KQL patterns and OpenTelemetry setup snippets into copy-paste-ready blocks for daily use. Keep these patterns handy during incident response when you need to quickly query error rates, latency percentiles, or trace a specific operation across services. The NuGet package reference at the bottom ensures you always know exactly which packages to install for each instrumentation scenario β from basic ASP.NET Core auto-instrumentation to full OTel with Prometheus metrics export.
// Always start with time filter
| where TimeGenerated > ago(1h)
// Error rate
| summarize ErrorRate = countif(success==false)*100.0/count() by bin(TimeGenerated,5m)
// Latency percentiles
| summarize P50=percentile(duration,50), P95=percentile(duration,95), P99=percentile(duration,99) by name
// Top N
| top 20 by Count desc
// Count distinct
| summarize UniqueUsers = dcount(user_Id)
// Dynamic property
| extend Region = tostring(customDimensions["Region"])
// Time render
| render timechart
// Cross AI resource query
app("MyOtherAppInsights").requests | where timestamp > ago(1h)
// Parse structured log message
| parse Message with "OrderId=" OrderId " status=" Status// Minimal OTel setup for a microservice
builder.Services.AddOpenTelemetry()
.ConfigureResource(r => r.AddService("my-service", "1.0.0"))
.WithTracing(t => t
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddSource("MyService")
.AddOtlpExporter())
.WithMetrics(m => m
.AddAspNetCoreInstrumentation()
.AddMeter("MyService")
.AddOtlpExporter())
.UseAzureMonitor(); // Also send to App Insights
// Custom span
using var activity = MySource.StartActivity("DoSomething");
activity?.SetTag("key", "value");
activity?.AddEvent(new ActivityEvent("StepCompleted"));
// Custom metric
MyCounter.Add(1, new KeyValuePair<string,object?>("region","EU"));| NuGet Package | Purpose |
|---|---|
| Microsoft.ApplicationInsights.AspNetCore | Classic App Insights SDK for ASP.NET Core |
| Azure.Monitor.OpenTelemetry.AspNetCore | Modern OTel-based Azure Monitor exporter (recommended) |
| OpenTelemetry.Extensions.Hosting | OTel host builder extensions |
| OpenTelemetry.Instrumentation.AspNetCore | Auto-instrument HTTP requests |
| OpenTelemetry.Instrumentation.Http | Auto-instrument HttpClient calls |
| OpenTelemetry.Instrumentation.EntityFrameworkCore | Auto-instrument EF Core SQL queries |
| OpenTelemetry.Instrumentation.Runtime | GC, thread pool, heap metrics |
| OpenTelemetry.Exporter.OpenTelemetryProtocol | OTLP exporter (Jaeger, Tempo, Collector) |
| OpenTelemetry.Exporter.Prometheus.AspNetCore | /metrics endpoint for Prometheus scraping |
| Serilog.AspNetCore + Serilog.Sinks.ApplicationInsights | Structured logging to App Insights via Serilog |
| Microsoft.ApplicationInsights.WorkerService | App Insights for background services and workers |