Chaos Engineering on Azure
Building Resilience Through Controlled Experimentation
Introduction
Chaos Engineering is the discipline of experimenting on a system to discover weaknesses before they cause outages in production. Rather than waiting for failures to happen, teams proactively inject failures to verify that their systems can handle them. On Azure, this means testing everything from single-instance failures to entire region outages, ensuring your integration workloads are truly resilient.
This comprehensive guide covers:
- Chaos engineering fundamentals — Principles and methodology
- Azure experiment categories — What can be tested
- Implementation approaches — Tools and frameworks
- Experiment design — Safe, controlled testing
- Integration with CI/CD — Automating chaos
Fundamentals of Chaos Engineering
Core Principles
┌─────────────────────────────────────────────────────────────────────┐
│ CHAOS ENGINEERING PRINCIPLES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ 1. START WITH A HYPOTHESIS │
│ ───────────────────────────── │
│ "If the primary Service Bus fails, the system will │
│ automatically route messages to the secondary region │
│ within 60 seconds without message loss." │
│ │
│ 2. VARY REAL-world CONDITIONS │
│ ─────────────────────────────── │
│ Test the things that actually break in production: │
│ - Network latency and timeouts │
│ - Resource exhaustion (CPU, memory, connections) │
│ - Service failures and exceptions │
│ - Infrastructure-level issues │
│ │
│ 3. RUN EXPERIMENTS IN PRODUCTION │
│ ───────────────────────────────── │
│ Only production reveals true behavior. │
│ Use canary deployments and gradual rollouts. │
│ │
│ 4. AUTOMATE EXPERIMENTS TO RUN CONTINUOUSLY │
│ ───────────────────────────────────────────── │
│ Manual chaos is not scalable. │
│ Integrate into CI/CD pipelines. │
│ │
│ 5. MINIMIZE BLAST RADIUS │
│ ────────────────────────── │
│ Never compromise customer experience. │
│ Use feature flags, canaries, and rollback plans. │
│ │
└─────────────────────────────────────────────────────────────────────┘
Experiment Workflow
┌─────────────────────────────────────────────────────────────────────┐
│ CHAOS EXPERIMENT WORKFLOW │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ PLAN │────▶│ EXECUTE │────▶│ ANALYZE │ │
│ │ │ │ │ │ │ │
│ │ Define │ │ Inject │ │ Collect │ │
│ │ hypothesis │ │ failure │ │ metrics │ │
│ │ │ │ │ │ │ │
│ └─────────────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ MONITOR │◀────│ LEARN │ │
│ │ │ │ │ │
│ │ Observe │ │ Document │ │
│ │ system │ │ findings │ │
│ │ behavior │ │ │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Azure Chaos Experiments
Experiment Categories
{
"chaosCategories": {
"application": [
"Simulate service unavailability",
"Inject exception handling delays",
"Throw unhandled exceptions",
"Simulate dependency timeout",
"Corrupt response data"
],
"infrastructure": [
"Stop Azure Function instance",
"Scale down App Service plan",
"Exhaust database connections",
"Simulate disk space pressure",
"Network latency injection"
],
"platform": [
"Region outage simulation",
"Availability zone failure",
"Service Bus unavailability",
"Cosmos DB region failure",
"Storage account connectivity loss"
],
"security": [
"Simulate token expiration",
"Revoke managed identity access",
"Inject authentication failures",
"Simulate permission denied errors"
]
}
}
Azure Chaos Studio
# Enable Chaos Studio
az provider register --namespace Microsoft.Chaos
# Create a chaos experiment (via ARM template)
az deployment group create \
--resource-group my-rg \
--template-file chaos-experiment.json
# List available capabilities
az chaos capability list \
--location eastus
# Run an experiment
az chaos experiment run \
--name my-experiment \
--resource-group my-rg
Experiment Definition
{
"name": "ServiceBusFailureExperiment",
"location": "eastus",
"properties": {
"steps": [
{
"name": "Step 1: Induce Service Bus failure",
"actions": [
{
"type": "continuous",
"name": "disconnect-sb",
"providerType": "Network",
"parameters": {
"subscriptionId": "xxx",
"resourceGroup": "rg-production",
"resourceType": "Microsoft.ServiceBus/namespaces",
"resourceName": "integration-sb",
"method": "disable"
},
"duration": "PT5M"
}
]
},
{
"name": "Step 2: Monitor system behavior",
"pause": {
"duration": "PT5M"
}
},
{
"name": "Step 3: Restore service",
"actions": [
{
"type": "continuous",
"name": "restore-sb",
"providerType": "Network",
"parameters": {
"subscriptionId": "xxx",
"resourceGroup": "rg-production",
"resourceType": "Microsoft.ServiceBus/namespaces",
"resourceName": "integration-sb",
"method": "enable"
},
"duration": "PT1M"
}
]
}
],
"selectors": [
{
"type": "List",
"name": "target-resources",
"id": "target-resources"
}
]
}
}
Implementation Examples
Service Bus Failure Test
public class ChaosServiceBusExperiment
{
public async Task TestFailoverAsync()
{
// Hypothesis: "When primary Service Bus becomes unavailable,
// application will seamlessly switch to secondary within 60 seconds"
var primaryClient = new ServiceBusClient(
"primary.servicebus.windows.net");
var secondaryClient = new ServiceBusClient(
"secondary.servicebus.windows.net");
// Start monitoring
var metrics = new List<MetricSnapshot>();
var monitorTask = Task.Run(async () =>
{
while (true)
{
metrics.Add(await CaptureMetricsAsync());
await Task.Delay(TimeSpan.FromSeconds(5));
}
});
// Inject failure
await DisableServiceBusNamespaceAsync("primary");
// Wait for failover
await Task.Delay(TimeSpan.FromSeconds(60));
// Verify
var finalMetrics = metrics.Last();
if (finalMetrics.FailedMessages > 0)
{
throw new ExperimentFailedException(
"Messages were lost during failover");
}
// Cleanup
await EnableServiceBusNamespaceAsync("primary");
}
}
Function App Scale Test
{
"experiment": {
"name": "FunctionAppScaleDown",
"hypothesis": "System will maintain 99% success rate even at 50% capacity",
"injection": {
"type": "resource",
"action": "scale-down",
"target": "function-app",
"scale": 0.5,
"duration": "PT10M"
},
"monitors": [
{
"name": "latency-monitor",
"metric": "Latency",
"threshold": "p99 < 500ms"
},
{
"name": "error-monitor",
"metric": "ErrorRate",
"threshold": "< 1%"
},
{
"name": "throughput-monitor",
"metric": "Throughput",
"minimum": "1000 req/min"
}
]
}
}
Database Connection Exhaustion
public class DbConnectionChaosExperiment
{
public async Task TestConnectionPoolExhaustionAsync()
{
// Hypothesis: "Application will fail gracefully when database
// connections are exhausted, with clear error messages"
// Get current connection count
var initialConnections = await GetDbConnectionCountAsync();
// Open maximum connections to exhaust pool
var connections = new List<SqlConnection>();
try
{
while (true)
{
var conn = new SqlConnection(_connectionString);
await conn.OpenAsync();
connections.Add(conn);
}
}
catch (SqlException ex) when (ex.Number == 233)
{
// Connection limit reached
}
// Test application behavior under pressure
var response = await _httpClient.GetAsync("/api/orders");
// Verify graceful degradation
Assert.Equal(HttpStatusCode.ServiceUnavailable, response.StatusCode);
// Verify clear error message
var content = await response.Content.ReadAsStringAsync();
Assert.Contains("database", content.ToLower());
}
finally
{
// Cleanup connections
foreach (var conn in connections)
{
await conn.DisposeAsync();
}
}
}
Safe Experimentation
Blast Radius Control
┌─────────────────────────────────────────────────────────────────────┐
│ BLAST RADIUS MITIGATION │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ STRATEGY IMPLEMENTATION │
│ ───────────────────────────────────────────────────────────────── │
│ Feature Flags Disable chaos for specific tenants │
│ Canary Routing Route subset of traffic to test region │
│ Time Windows Run during low-traffic periods │
│ gradual Rollout Start with 1% of traffic, scale up │
│ Automated Rollback Stop experiment if metrics degrade │
│ Circuit Breaker Stop if downstream services affected │
│ │
│ ───────────────────────────────────────────────────────────────── │
│ │
│ STOP CONDITIONS (always define these): │
│ • Error rate > 5% │
│ • Latency increase > 100% │
│ • Customer impact detected │
│ • Any 5xx error on critical endpoints │
│ │
└─────────────────────────────────────────────────────────────────────┘
Experiment Guardrails
public class ExperimentGuardrails
{
public async Task<bool> ShouldContinueAsync()
{
var currentMetrics = await GetCurrentMetricsAsync();
// Check stop conditions
if (currentMetrics.ErrorRate > 0.05m)
{
await StopExperimentAsync("Error rate exceeded 5%");
return false;
}
if (currentMetrics.LatencyP99 > _baselineLatency * 2)
{
await StopExperimentAsync("Latency doubled");
return false;
}
if (await IsBusinessHoursAsync() && _preventBusinessHours)
{
await StopExperimentAsync("Business hours - cannot run");
return false;
}
return true;
}
public async Task RollbackAsync()
{
// Restore all modified configurations
await RestoreServiceBusConfigurationAsync();
await RestoreFunctionAppConfigurationAsync();
await ReleaseDatabaseConnectionsAsync();
}
}
Integration with CI/CD
GitHub Actions Chaos Pipeline
name: Chaos Engineering Pipeline
on:
schedule:
- cron: '0 2 * * 1' # Weekly at 2 AM Monday
workflow_dispatch:
jobs:
plan:
runs-on: ubuntu-latest
outputs:
experiment: ${{ steps.select.outputs.experiment }}
steps:
- uses: actions/checkout@v3
- name: Select experiment
id: select
run: |
# Rotate through different experiments
echo "experiment=servicebus-failover" >> $GITHUB_OUTPUT
execute:
needs: plan
runs-on: ubuntu-latest
steps:
- name: Run chaos experiment
run: |
az chaos experiment run \
--name ${{ needs.plan.outputs.experiment }} \
--resource-group rg-chaos
- name: Monitor during experiment
run: |
# Check Application Insights for anomalies
az monitor app-insights query \
--app my-app-insights \
--analytics-query "requests | where timestamp > ago(10m)"
- name: Collect results
run: |
# Store experiment results
echo "Experiment completed successfully"
report:
needs: execute
runs-on: ubuntu-latest
steps:
- name: Generate report
run: |
# Create issue or PR with results
echo "Chaos experiment results" >> $GITHUB_STEP_SUMMARY
- name: Notify on failure
if: failure()
run: |
# Send Slack notification
curl -X POST $SLACK_WEBHOOK \
-d '{"text": "Chaos experiment failed!"}'
Best Practices
Implementation Checklist
| Practice | Description |
|---|---|
| Start small | Begin with application-level chaos, not infrastructure |
| Hypothesis-first | Always define what you're testing and why |
| Monitor everything | Capture metrics before, during, and after |
| Automated rollback | Always have a way to quickly stop the experiment |
| Document findings | Build institutional knowledge from each experiment |
| Regular cadence | Integrate into regular engineering practice |
Anti-Patterns
┌─────────────────────────────────────────────────────────────────────┐
│ CHAOS ANTI-PATTERNS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ✗ Testing in production without monitoring │
│ ✗ Running experiments without rollback plan │
│ ✗ Testing during peak hours │
│ ✗ Not telling anyone you're running chaos │
│ ✗ Blaming teams when experiments reveal issues │
│ ✗ Running experiments without hypothesis │
│ ✗ Ignoring successful experiments (they're data too!) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Related Topics
- RTO & RPO Design — Recovery objectives
- Availability Zones — Zone resilience
- Distributed Tracing — Observability
Azure Integration Hub - Architect Level Multi-Region & High Availability