SLI / SLO / SLA Definition & Alerting

Building Reliable Integration Services

Introduction

SLI (Service Level Indicator), SLO (Service Level Objective), and SLA (Service Level Agreement) are the three pillars of reliability measurement for integration services. SLIs are the metrics you measure, SLOs are the targets you set, and SLAs are the commitments you make to customers. Understanding how to define these correctly for Azure integration workloads ensures you measure what matters and meet your reliability commitments.

This comprehensive guide covers:

SLI definitions — Choosing meaningful metrics
SLO setting — Establishing achievable targets
SLA commitments — Customer-facing agreements
Alerting strategy — When to page teams
Error budgets — Managing reliability vs. velocity

Understanding the Concepts

SLI, SLO, SLA Relationship

┌─────────────────────────────────────────────────────────────────────┐
│                  SLI / SLO / SLA RELATIONSHIP                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   SLI (Service Level Indicator)                                     │
│   ─────────────────────────────────────                             │
│   What you measure:                                                 │
│   - Request latency                                                 │
│   - Error rate                                                      │
│   - Throughput (requests/second)                                    │
│   - Availability (% of successful requests)                         │
│   - Queue depth                                                     │
│                                                                     │
│   SLO (Service Level Objective)                                     │
│   ───────────────────────────                                       │
│   Target you set (internal):                                        │
│   - 99.9% of requests succeed                                       │
│   - p99 latency < 500ms                                             │
│   - 99.95% availability                                             │
│   - < 1000 messages queued                                          │
│                                                                     │
│   SLA (Service Level Agreement)                                     │
│   ─────────────────────────────                                     │
│   Commitment to customers (external):                               │
│   - 99.5% availability (with credits if missed)                     │
│   - Response time < 2 seconds                                       │
│   - 99% of messages processed within 5 minutes                      │
│                                                                     │
│   ───────────────────────────────────────────────────               │
│   SLO should be stricter than SLA (SLAs have contractual            │
│   consequences, SLOs trigger internal action)                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Integration Service SLIs

┌─────────────────────────────────────────────────────────────────────┐
│              INTEGRATION SERVICE SLIS                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   AZURE FUNCTIONS                                                   │
│   ✓ Execution success rate                                          │
│   ✓ Execution duration (p50, p95, p99)                              │
│   ✓ Function invocation rate                                        │
│   ✓ Cold start time                                                 │
│   ✓ Memory utilization                                              │
│                                                                     │
│   SERVICE BUS                                                       │
│   ✓ Message throughput (in/out)                                     │
│   ✓ Queue depth (active messages)                                   │
│   ✓ Message processing latency                                      │
│   ✓ Dead letter queue size                                          │
│   ✓ Connection success rate                                         │
│                                                                     │
│   LOGIC APPS                                                        │
│   ✓ Run success rate                                                │
│   ✓ Run duration                                                    │
│   ✓ Trigger frequency                                               │
│   ✓ Action failure rate                                             │
│   ✓ Workflow queue depth                                            │
│                                                                     │
│   API MANAGEMENT                                                    │
│   ✓ Request success rate (backend)                                  │
│   ✓ Request latency (p95, p99)                                      │
│   ✓ Backend latency                                                 │
│   ✓ Policy violation rate                                           │
│   ✓ Subscription quota usage                                        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

SLO Definitions

Setting Meaningful SLOs

{
  "integrationSLOs": {
    "criticalPath": {
      "service": "Order Processing API",
      "slis": [
        {
          "name": "success_rate",
          "description": "Percentage of successful requests",
          "SLI": "ratio(successful_requests / total_requests)",
          "SLO": "99.9% over 30 days",
          "currentBaseline": "99.7%"
        },
        {
          "name": "latency_p99",
          "description": "99th percentile latency",
          "SLI": "histogram_quantile(0.99, latency_bucket)",
          "SLO": "< 500ms over 30 days",
          "currentBaseline": "320ms"
        }
      ]
    },
    "messageProcessing": {
      "service": "Order Message Processor",
      "slis": [
        {
          "name": "processing_success",
          "description": "Successfully processed messages",
          "SLI": "ratio(processed_successfully / total_received)",
          "SLO": "99.95% over 7 days",
          "currentBaseline": "99.92%"
        },
        {
          "name": "processing_latency",
          "description": "End-to-end processing time",
          "SLI": "p95(time_received_to_completed)",
          "SLO": "< 60 seconds over 7 days",
          "currentBaseline": "25 seconds"
        },
        {
          "name": "queue_depth",
          "description": "Messages waiting to be processed",
          "SLI": "max(active_messages)",
          "SLO": "< 1000 for 99% of time",
          "currentBaseline": "450"
        }
      ]
    },
    "dataIntegration": {
      "service": "Customer Data Sync",
      "slis": [
        {
          "name": "sync_success",
          "description": "Successful sync operations",
          "SLI": "ratio(successful_syncs / total_syncs)",
          "SLO": "99% over 24 hours",
          "currentBaseline": "99.5%"
        },
        {
          "name": "sync_latency",
          "description": "Time from change to sync",
          "SLI": "p95(time_change_to_synced)",
          "SLO": "< 5 minutes over 24 hours",
          "currentBaseline": "2 minutes"
        }
      ]
    }
  }
}

Error Budget

public class ErrorBudgetCalculator
{
    public class ErrorBudgetStatus
    {
        public string SLO { get; set; }
        public double ErrorBudgetRemaining { get; set; }
        public double ErrorBudgetUsed { get; set; }
        public double BurnRate { get; set; }
        public TimeSpan TimeRemaining { get; set; }
    }

    public ErrorBudgetStatus CalculateStatus(
        string sloName, 
        TimeWindow window)
    {
        var totalRequests = GetTotalRequests(sloName, window);
        var allowedErrors = totalRequests * (1 - GetSLOThreshold(sloName));
        var actualErrors = GetErrorCount(sloName, window);
        
        var remaining = allowedErrors - actualErrors;
        var used = actualErrors;
        
        // Calculate burn rate (how fast we're using budget)
        var firstHalf = window.Start.Add(window.Duration / 2);
        var firstHalfErrors = GetErrorCount(sloName, new TimeWindow 
        { 
            Start = window.Start, 
            Duration = window.Duration / 2 
        });
        
        var burnRate = firstHalfErrors > 0 
            ? actualErrors / firstHalfErrors 
            : 1.0;

        return new ErrorBudgetStatus
        {
            SLO = sloName,
            ErrorBudgetRemaining = remaining,
            ErrorBudgetUsed = used,
            BurnRate = burnRate,
            TimeRemaining = window.End - DateTime.UtcNow
        };
    }
}

Alerting Strategy

Alert Configuration

{
  "alertConfig": {
    "burnRateAlert": {
      "description": "Alert when burning error budget too fast",
      "condition": "burn_rate > 2 for 1 hour",
      "severity": "Warning",
      "action": "Investigate - might need to slow feature work"
    },
    "budgetExhaustionAlert": {
      "description": "Alert when error budget nearly exhausted",
      "condition": "remaining_budget < 10%",
      "severity": "Critical",
      "action": "Freeze non-essential changes, focus on reliability"
    },
    "sloViolationAlert": {
      "description": "Alert when SLO already violated",
      "condition": "current_period_error_rate > allowed",
      "severity": "Critical",
      "action": "Page on-call, prioritize fixing"
    }
  }
}

Alert Implementation

public class SLOAlertService
{
    public async Task EvaluateAndAlertAsync(SLI sli)
    {
        var recentStatus = await GetRecentStatusAsync(sli);

        // Check burn rate
        if (recentStatus.BurnRate > 2.0)
        {
            await SendAlertAsync(new Alert
            {
                Severity = AlertSeverity.Warning,
                Title = $"High burn rate for {sli.Name}",
                Description = $"Burn rate: {recentStatus.BurnRate:F1}x " +
                    $"({recentStatus.ErrorsBurned} errors burned in " +
                    $"{recentStatus.WindowDuration})",
                Action = "Investigate recent deployments and errors"
            });
        }

        // Check budget exhaustion
        if (recentStatus.BudgetRemainingPercent < 10)
        {
            await SendAlertAsync(new Alert
            {
                Severity = AlertSeverity.Critical,
                Title = $"Error budget nearly exhausted for {sli.Name}",
                Description = $"{recentStatus.BudgetRemainingPercent:F1}% " +
                    $"remaining",
                Action = "Consider freeze on feature work"
            });
        }

        // Check immediate violation
        if (recentStatus.CurrentViolation)
        {
            await SendAlertAsync(new Alert
            {
                Severity = AlertSeverity.Critical,
                Title = $"SLO violation for {sli.Name}",
                Description = $"Currently {recentStatus.CurrentErrorRate:P2} " +
                    $"vs SLO of {sli.Target:P2}",
                Action = "Page on-call immediately"
            });
        }
    }
}

SLA Definitions

Customer-Facing SLAs

{
  "integrationSLAs": {
    "apiIntegration": {
      "name": "API Integration Service",
      "availability": "99.5%",
      "credits": {
        "99.0-99.5": "10% monthly credit",
        "95-99.0": "25% monthly credit",
        "90-95": "50% monthly credit",
        "< 90": "100% monthly credit"
      },
      "exclusions": [
        "Scheduled maintenance (announced 48h in advance)",
        "Force majeure events",
        "Customer-caused issues"
      ]
    },
    "messageProcessing": {
      "name": "Message Processing Service",
      "processingTime": "99% within 5 minutes",
      "delivery": "99.9% successful delivery",
      "exclusions": [
        "Upstream system outages",
        "Message format errors",
        "Throttling due to customer action"
      ]
    },
    "dataSync": {
      "name": "Data Synchronization",
      "syncLatency": "99% within 15 minutes",
      "syncReliability": "99.5% successful syncs",
      "exclusions": [
        "Source system unavailability",
        "Data volume exceeds agreed limits",
        "Network issues beyond our control"
      ]
    }
  }
}

Measuring and Reporting

SLO Dashboard

┌─────────────────────────────────────────────────────────────────────┐
│                    SLO DASHBOARD EXAMPLE                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ORDER PROCESSING API                                              │
│   ─────────────────────                                             │
│                                                                     │
│   Success Rate SLO: 99.9% (30-day window)                           │
│   ════════════════════════════════════                              │
│   Current:    99.85% ████████████████████░░░░░░                     │
│   Budget:     87% remaining                                         │
│   Burn Rate:  1.2x (healthy)                                        │
│                                                                     │
│   Latency SLO: p99 < 500ms (30-day window)                          │
│   ═════════════════════════════════════                             │
│   Current:    420ms █████████████████████░░░░░░                     │
│   Budget:     92% remaining                                         │
│   Burn Rate:  0.9x (healthy)                                        │
│                                                                     │
│   ───────────────────────────────────────────────────               │
│                                                                     │
│   MESSAGE PROCESSOR                                                 │
│   ─────────────────                                                 │
│                                                                     │
│   Processing SLO: 99.95% (7-day window)                             │
│   Current:    99.91% ⚠️ (budget at 60%)                             │
│   Burn Rate:  1.8x ⚠️                                               │
│   Action:     Investigating recent spike in DLQ                     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Automated Status Page

public class StatusPageService
{
    public async Task<ServiceStatus> GetOverallStatusAsync()
    {
        var slos = await GetAllSLOStatusAsync();

        var overallStatus = slos.All(s => s.IsHealthy) 
            ? StatusLevel.Operational 
            : StatusLevel.Degraded;

        var currentIncident = slos.FirstOrDefault(s => !s.IsHealthy);

        return new ServiceStatus
        {
            Status = overallStatus,
            Components = slos.Select(s => new ComponentStatus
            {
                Name = s.ServiceName,
                Status = s.IsHealthy ? "operational" : "degraded",
                SLOStatus = s.CurrentValue
            }).ToList(),
            ActiveIncident = currentIncident != null 
                ? new Incident 
                {
                    Title = $"SLO breach: {currentIncident.SLOName}",
                    Description = $"Current: {currentIncident.CurrentValue:P2}, " +
                        $"Target: {currentIncident.Target:P2}",
                    Status = "investigating"
                } 
                : null
        };
    }
}

Best Practices

Implementation Checklist

Practice	Description
Measure what matters	Focus on user-facing metrics
SLO stricter than SLA	Leave buffer for SLA credits
Use appropriate windows	Match window to user expectations
Track burn rate	Know how fast you're consuming budget
Alert on trends	Don't wait for breach to alert
Review regularly	Adjust SLOs based on actual behavior

Common Mistakes

┌─────────────────────────────────────────────────────────────────────┐
│                  SLO MISTAKES TO AVOID                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ✗ Setting SLO too tight (always breaking)                         │
│   ✗ Not having error budget (focus only on SLO success)             │
│   ✗ Alerting on every deviation (alert fatigue)                     │
│   ✗ Using latency averages (use percentiles)                        │
│   ✗ Not excluding planned maintenance                               │
│   ✗ SLO = 100% (impossible, causes burnout)                         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘