Chaos Engineering on Azure

Building Resilience Through Controlled Experimentation


Introduction

Chaos Engineering is the discipline of experimenting on a system to discover weaknesses before they cause outages in production. Rather than waiting for failures to happen, teams proactively inject failures to verify that their systems can handle them. On Azure, this means testing everything from single-instance failures to entire region outages, ensuring your integration workloads are truly resilient.

This comprehensive guide covers:

  • Chaos engineering fundamentals — Principles and methodology
  • Azure experiment categories — What can be tested
  • Implementation approaches — Tools and frameworks
  • Experiment design — Safe, controlled testing
  • Integration with CI/CD — Automating chaos

Fundamentals of Chaos Engineering

Core Principles

┌─────────────────────────────────────────────────────────────────────┐
│                  CHAOS ENGINEERING PRINCIPLES                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   1. START WITH A HYPOTHESIS                                        │
│   ─────────────────────────────                                     │
│   "If the primary Service Bus fails, the system will                │
│    automatically route messages to the secondary region             │
│    within 60 seconds without message loss."                         │
│                                                                     │
│   2. VARY REAL-world CONDITIONS                                     │
│   ───────────────────────────────                                   │
│   Test the things that actually break in production:                │
│   - Network latency and timeouts                                    │
│   - Resource exhaustion (CPU, memory, connections)                  │
│   - Service failures and exceptions                                 │
│   - Infrastructure-level issues                                     │
│                                                                     │
│   3. RUN EXPERIMENTS IN PRODUCTION                                  │
│   ─────────────────────────────────                                 │
│   Only production reveals true behavior.                            │
│   Use canary deployments and gradual rollouts.                      │
│                                                                     │
│   4. AUTOMATE EXPERIMENTS TO RUN CONTINUOUSLY                       │
│   ─────────────────────────────────────────────                     │
│   Manual chaos is not scalable.                                     │
│   Integrate into CI/CD pipelines.                                   │
│                                                                     │
│   5. MINIMIZE BLAST RADIUS                                          │
│   ──────────────────────────                                        │
│   Never compromise customer experience.                             │
│   Use feature flags, canaries, and rollback plans.                  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Experiment Workflow

┌─────────────────────────────────────────────────────────────────────┐
│                  CHAOS EXPERIMENT WORKFLOW                          │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ┌─────────────┐     ┌─────────────┐     ┌─────────────┐           │
│   │   PLAN      │────▶│   EXECUTE   │────▶│   ANALYZE   │           │
│   │             │     │             │     │             │           │
│   │ Define      │     │ Inject      │     │ Collect     │           │
│   │ hypothesis  │     │ failure     │     │ metrics     │           │
│   │             │     │             │     │             │           │
│   └─────────────┘     └──────┬──────┘     └──────┬──────┘           │
│                              │                   │                  │
│                              ▼                   ▼                  │
│                        ┌─────────────┐     ┌─────────────┐          │
│                        │   MONITOR   │◀────│   LEARN     │          │
│                        │             │     │             │          │
│                        │ Observe     │     │ Document    │          │
│                        │ system      │     │ findings    │          │
│                        │ behavior    │     │             │          │
│                        └─────────────┘     └─────────────┘          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Azure Chaos Experiments

Experiment Categories

{
  "chaosCategories": {
    "application": [
      "Simulate service unavailability",
      "Inject exception handling delays",
      "Throw unhandled exceptions",
      "Simulate dependency timeout",
      "Corrupt response data"
    ],
    "infrastructure": [
      "Stop Azure Function instance",
      "Scale down App Service plan",
      "Exhaust database connections",
      "Simulate disk space pressure",
      "Network latency injection"
    ],
    "platform": [
      "Region outage simulation",
      "Availability zone failure",
      "Service Bus unavailability",
      "Cosmos DB region failure",
      "Storage account connectivity loss"
    ],
    "security": [
      "Simulate token expiration",
      "Revoke managed identity access",
      "Inject authentication failures",
      "Simulate permission denied errors"
    ]
  }
}

Azure Chaos Studio

# Enable Chaos Studio
az provider register --namespace Microsoft.Chaos

# Create a chaos experiment (via ARM template)
az deployment group create \
  --resource-group my-rg \
  --template-file chaos-experiment.json

# List available capabilities
az chaos capability list \
  --location eastus

# Run an experiment
az chaos experiment run \
  --name my-experiment \
  --resource-group my-rg

Experiment Definition

{
  "name": "ServiceBusFailureExperiment",
  "location": "eastus",
  "properties": {
    "steps": [
      {
        "name": "Step 1: Induce Service Bus failure",
        "actions": [
          {
            "type": "continuous",
            "name": "disconnect-sb",
            "providerType": "Network",
            "parameters": {
              "subscriptionId": "xxx",
              "resourceGroup": "rg-production",
              "resourceType": "Microsoft.ServiceBus/namespaces",
              "resourceName": "integration-sb",
              "method": "disable"
            },
            "duration": "PT5M"
          }
        ]
      },
      {
        "name": "Step 2: Monitor system behavior",
        "pause": {
          "duration": "PT5M"
        }
      },
      {
        "name": "Step 3: Restore service",
        "actions": [
          {
            "type": "continuous",
            "name": "restore-sb",
            "providerType": "Network",
            "parameters": {
              "subscriptionId": "xxx",
              "resourceGroup": "rg-production",
              "resourceType": "Microsoft.ServiceBus/namespaces",
              "resourceName": "integration-sb",
              "method": "enable"
            },
            "duration": "PT1M"
          }
        ]
      }
    ],
    "selectors": [
      {
        "type": "List",
        "name": "target-resources",
        "id": "target-resources"
      }
    ]
  }
}

Implementation Examples

Service Bus Failure Test

public class ChaosServiceBusExperiment
{
    public async Task TestFailoverAsync()
    {
        // Hypothesis: "When primary Service Bus becomes unavailable,
        // application will seamlessly switch to secondary within 60 seconds"

        var primaryClient = new ServiceBusClient(
            "primary.servicebus.windows.net");
        var secondaryClient = new ServiceBusClient(
            "secondary.servicebus.windows.net");

        // Start monitoring
        var metrics = new List<MetricSnapshot>();
        var monitorTask = Task.Run(async () =>
        {
            while (true)
            {
                metrics.Add(await CaptureMetricsAsync());
                await Task.Delay(TimeSpan.FromSeconds(5));
            }
        });

        // Inject failure
        await DisableServiceBusNamespaceAsync("primary");

        // Wait for failover
        await Task.Delay(TimeSpan.FromSeconds(60));

        // Verify
        var finalMetrics = metrics.Last();
        if (finalMetrics.FailedMessages > 0)
        {
            throw new ExperimentFailedException(
                "Messages were lost during failover");
        }

        // Cleanup
        await EnableServiceBusNamespaceAsync("primary");
    }
}

Function App Scale Test

{
  "experiment": {
    "name": "FunctionAppScaleDown",
    "hypothesis": "System will maintain 99% success rate even at 50% capacity",
    "injection": {
      "type": "resource",
      "action": "scale-down",
      "target": "function-app",
      "scale": 0.5,
      "duration": "PT10M"
    },
    "monitors": [
      {
        "name": "latency-monitor",
        "metric": "Latency",
        "threshold": "p99 < 500ms"
      },
      {
        "name": "error-monitor",
        "metric": "ErrorRate",
        "threshold": "< 1%"
      },
      {
        "name": "throughput-monitor",
        "metric": "Throughput",
        "minimum": "1000 req/min"
      }
    ]
  }
}

Database Connection Exhaustion

public class DbConnectionChaosExperiment
{
    public async Task TestConnectionPoolExhaustionAsync()
    {
        // Hypothesis: "Application will fail gracefully when database
        // connections are exhausted, with clear error messages"

        // Get current connection count
        var initialConnections = await GetDbConnectionCountAsync();

        // Open maximum connections to exhaust pool
        var connections = new List<SqlConnection>();
        try
        {
            while (true)
            {
                var conn = new SqlConnection(_connectionString);
                await conn.OpenAsync();
                connections.Add(conn);
            }
        }
        catch (SqlException ex) when (ex.Number == 233)
        {
            // Connection limit reached
        }

        // Test application behavior under pressure
        var response = await _httpClient.GetAsync("/api/orders");
        
        // Verify graceful degradation
        Assert.Equal(HttpStatusCode.ServiceUnavailable, response.StatusCode);

        // Verify clear error message
        var content = await response.Content.ReadAsStringAsync();
        Assert.Contains("database", content.ToLower());
    }
    finally
    {
        // Cleanup connections
        foreach (var conn in connections)
        {
            await conn.DisposeAsync();
        }
    }
}

Safe Experimentation

Blast Radius Control

┌─────────────────────────────────────────────────────────────────────┐
│                  BLAST RADIUS MITIGATION                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   STRATEGY              IMPLEMENTATION                              │
│   ─────────────────────────────────────────────────────────────────  │
│   Feature Flags         Disable chaos for specific tenants         │
│   Canary Routing       Route subset of traffic to test region     │
│   Time Windows         Run during low-traffic periods             │
│  gradual Rollout       Start with 1% of traffic, scale up         │
│   Automated Rollback   Stop experiment if metrics degrade         │
│   Circuit Breaker      Stop if downstream services affected        │
│                                                                      │
│   ─────────────────────────────────────────────────────────────────  │
│                                                                      │
│   STOP CONDITIONS (always define these):                           │
│   • Error rate > 5%                                                 │
│   • Latency increase > 100%                                         │
│   • Customer impact detected                                        │
│   • Any 5xx error on critical endpoints                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Experiment Guardrails

public class ExperimentGuardrails
{
    public async Task<bool> ShouldContinueAsync()
    {
        var currentMetrics = await GetCurrentMetricsAsync();

        // Check stop conditions
        if (currentMetrics.ErrorRate > 0.05m)
        {
            await StopExperimentAsync("Error rate exceeded 5%");
            return false;
        }

        if (currentMetrics.LatencyP99 > _baselineLatency * 2)
        {
            await StopExperimentAsync("Latency doubled");
            return false;
        }

        if (await IsBusinessHoursAsync() && _preventBusinessHours)
        {
            await StopExperimentAsync("Business hours - cannot run");
            return false;
        }

        return true;
    }

    public async Task RollbackAsync()
    {
        // Restore all modified configurations
        await RestoreServiceBusConfigurationAsync();
        await RestoreFunctionAppConfigurationAsync();
        await ReleaseDatabaseConnectionsAsync();
    }
}

Integration with CI/CD

GitHub Actions Chaos Pipeline

name: Chaos Engineering Pipeline

on:
  schedule:
    - cron: '0 2 * * 1'  # Weekly at 2 AM Monday
  workflow_dispatch:

jobs:
  plan:
    runs-on: ubuntu-latest
    outputs:
      experiment: ${{ steps.select.outputs.experiment }}
    steps:
      - uses: actions/checkout@v3
      - name: Select experiment
        id: select
        run: |
          # Rotate through different experiments
          echo "experiment=servicebus-failover" >> $GITHUB_OUTPUT

  execute:
    needs: plan
    runs-on: ubuntu-latest
    steps:
      - name: Run chaos experiment
        run: |
          az chaos experiment run \
            --name ${{ needs.plan.outputs.experiment }} \
            --resource-group rg-chaos

      - name: Monitor during experiment
        run: |
          # Check Application Insights for anomalies
          az monitor app-insights query \
            --app my-app-insights \
            --analytics-query "requests | where timestamp > ago(10m)"

      - name: Collect results
        run: |
          # Store experiment results
          echo "Experiment completed successfully"

  report:
    needs: execute
    runs-on: ubuntu-latest
    steps:
      - name: Generate report
        run: |
          # Create issue or PR with results
          echo "Chaos experiment results" >> $GITHUB_STEP_SUMMARY

      - name: Notify on failure
        if: failure()
        run: |
          # Send Slack notification
          curl -X POST $SLACK_WEBHOOK \
            -d '{"text": "Chaos experiment failed!"}'

Best Practices

Implementation Checklist

PracticeDescription
Start smallBegin with application-level chaos, not infrastructure
Hypothesis-firstAlways define what you're testing and why
Monitor everythingCapture metrics before, during, and after
Automated rollbackAlways have a way to quickly stop the experiment
Document findingsBuild institutional knowledge from each experiment
Regular cadenceIntegrate into regular engineering practice

Anti-Patterns

┌─────────────────────────────────────────────────────────────────────┐
│                  CHAOS ANTI-PATTERNS                                │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ✗ Testing in production without monitoring                        │
│   ✗ Running experiments without rollback plan                       │
│   ✗ Testing during peak hours                                       │
│   ✗ Not telling anyone you're running chaos                         │
│   ✗ Blaming teams when experiments reveal issues                    │
│   ✗ Running experiments without hypothesis                          │
│   ✗ Ignoring successful experiments (they're data too!)             │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Related Topics


Azure Integration Hub - Architect Level Multi-Region & High Availability