Availability Zones — Design for Zone Failure
Building Zone-Resilient Azure Architectures
Introduction
Availability Zones are physically separate data centers within an Azure region, each with independent power, cooling, and networking. Designing your architecture to handle zone failures ensures your applications remain available even when an entire data center experiences an outage. Understanding how to leverage Availability Zones effectively is crucial for achieving high availability in Azure.
This comprehensive guide covers:
- Availability Zone concepts — Understanding zone architecture
- Zone vs Region — When to use each
- Azure services by zone support — What's available where
- Implementation patterns — Configuring zone-redundant resources
- Zone-aware coding — Designing for zone failures
- Cost considerations — Balancing availability and budget
Understanding Availability Zones
How Availability Zones Work
┌────────────────────────────────────────────────────────────────────────┐
│ AVAILABILITY ZONES ARCHITECTURE │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ AZURE REGION │ │
│ │ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌───────────────┐ │ │
│ │ │ ZONE 1 │ │ ZONE 2 │ │ ZONE 3 │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ │ │ Compute │ │ │ │ Compute │ │ │ │ Compute │ │ │ │
│ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ │ │ Network │ │ │ │ Network │ │ │ │ Network │ │ │ │
│ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │ │
│ │ │ │ Storage │ │ │ │ Storage │ │ │ │ Storage │ │ │ │
│ │ │ └──────────┘ │ │ └──────────┘ │ │ └──────────┘ │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ └─────────────────┘ └─────────────────┘ └───────────────┘ │ │
│ │ │ │ │ │ │
│ │ └─────────────────────┼──────────────────┘ │ │
│ │ │ │ │
│ │ Low-latency interconnect │ │
│ │ │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │
│ Zone Failure Impact: │
│ - Zone 1 fails → Zone 2 & 3 remain operational │
│ - Applications span zones → zero downtime │
│ │
└────────────────────────────────────────────────────────────────────────┘
Region Support for Availability Zones
┌─────────────────────────────────────────────────────────────────────┐
│ REGIONS WITH AVAILABILITY ZONES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Americas: │
│ ────────── │
│ ✓ East US, East US 2, West US 2, West US 3 │
│ ✓ Central US, North Central US, South Central US │
│ ✓ Canada Central │
│ │
│ Europe: │
│ ──────── │
│ ✓ West Europe, North Europe │
│ ✓ UK South, France Central │
│ ✓ Germany West Central, Norway East │
│ │
│ Asia Pacific: │
│ ───────────── │
│ ✓ Southeast Asia, East Asia │
│ ✓ Japan East, Japan West │
│ ✓ Australia East, Central India │
│ │
│ Note: Not all regions support Availability Zones. │
│ Always check: az account list-locations --query "[].name" │
│ │
└─────────────────────────────────────────────────────────────────────┘
Azure Services and Zone Support
Service Categories
┌─────────────────────────────────────────────────────────────────────┐
│ AZURE SERVICES ZONE SUPPORT │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ZONE-RESOURCES (Automatic Zone Distribution) │
│ ──────────────────────────────────────────── │
│ ✓ Azure VMs (Zone-redundant scale sets) │
│ ✓ Managed Disks │
│ ✓ Public IP Addresses │
│ ✓ Azure Kubernetes Service (-zone-redundant) │
│ ✓ Azure SQL (zone-redundant HA) │
│ ✓ Azure Service Bus (Premium namespace) │
│ │
│ ZONE-SPECIFIC (Deploy to Specific Zone) │
│ ────────────────────────────────────────── │
│ ✓ App Service Plans (ASE) │
│ ✓ Azure Functions (Elastic Premium) │
│ ✓ Key Vault (Standard/Premium) │
│ ✓ Azure Storage (zone-redundant storage - ZRS) │
│ ✓ Event Hubs (dedicated clusters) │
│ │
│ REGION-PAIRS (Cross-Region for DR) │
│ ───────────────────────────────────── │
│ ✓ Azure Backup │
│ ✓ Azure Site Recovery │
│ ✓ Geo-redundant Storage (GRS) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Compute Services by Zone Support
| Service | Zone Support | Notes |
|---|---|---|
| Virtual Machines | Zone-redundant | Use Availability Sets + Managed Disks |
| VM Scale Sets | Zone-redundant | Automatic distribution across zones |
| App Service | Zone-specific | Deploy to multiple App Service Plans |
| Azure Functions | Zone-specific | Use Premium plan with zone support |
| AKS | Zone-redundant | System node pools across zones |
| Container Apps | Zone-specific | Deploy to multiple environments |
Implementation Patterns
Zone-Redundant Virtual Machines
# Create zone-redundant VM scale set
az vmss create \
--name my-scaleset \
--resource-group my-rg \
--location eastus \
--vm-sku Standard_D2s_v3 \
--instance-count 3 \
-- zones 1 2 3 \
--upgrade-policy-mode Automatic \
--load-balancer-sku Standard
# Create VM with zone-specific deployment
az vm create \
--name myvm \
--resource-group my-rg \
--location eastus \
--image UbuntuLTS \
--size Standard_D2s_v3 \
--zone 1
Zone-Redundant Storage (ZRS)
# Create storage account with ZRS
az storage account create \
--name mystorage \
--resource-group my-rg \
--location eastus \
--sku Standard_ZRS
# Create blob container with ZRS
az storage container create \
--name mycontainer \
--account-name mystorage \
--failover-tier Standard
Zone-Redundant Cosmos DB
{
"resource": {
"type": "Microsoft.DocumentDB/databaseAccounts",
"apiVersion": "2023-04-15",
"location": "eastus",
"properties": {
"locations": [
{ "locationName": "eastus", "isZoneRedundant": true },
{ "locationName": "eastus2", "isZoneRedundant": true },
{ "locationName": "westus2", "isZoneRedundant": true }
]
}
}
}
Zone-Aware Application Design
Distribute Across Zones
public class ZoneAwareLoadBalancer
{
private readonly List<ZoneEndpoint> _endpoints;
private int _currentIndex = 0;
public ZoneAwareLoadBalancer()
{
_endpoints = new List<ZoneEndpoint>
{
new ZoneEndpoint { Zone = "1", Endpoint = "https://app-eastus-1.azurewebsites.net" },
new ZoneEndpoint { Zone = "2", Endpoint = "https://app-eastus-2.azurewebsites.net" },
new ZoneEndpoint { Zone = "3", Endpoint = "https://app-eastus-3.azurewebsites.net" }
};
}
public async Task<string> CallServiceAsync()
{
// Try each zone, fall back if one fails
foreach (var endpoint in _endpoints)
{
try
{
return await CallEndpointAsync(endpoint);
}
catch (Exception ex)
{
Console.WriteLine($"Zone {endpoint.Zone} failed: {ex.Message}");
// Continue to next zone
}
}
throw new Exception("All zones unavailable");
}
private async Task<string> CallEndpointAsync(ZoneEndpoint endpoint)
{
using var client = new HttpClient();
var response = await client.GetAsync($"{endpoint.Endpoint}/api/health");
response.EnsureSuccessStatusCode();
return await response.Content.ReadAsStringAsync();
}
}
Health Check for Zones
[FunctionName("ZoneHealthCheck")]
public async Task<IActionResult> Run([HttpTrigger] HttpRequest req)
{
var zones = new List<string> { "1", "2", "3" };
var results = new List<ZoneHealth>();
foreach (var zone in zones)
{
var isHealthy = await CheckZoneHealthAsync(zone);
results.Add(new ZoneHealth { Zone = zone, IsHealthy = isHealthy });
}
var allHealthy = results.All(r => r.IsHealthy);
var status = allHealthy ? 200 : 503;
return new JsonResult(results) { StatusCode = status };
}
private async Task<bool> CheckZoneHealthAsync(string zone)
{
try
{
var endpoint = $"https://app-eastus-{zone}.azurewebsites.net/api/health";
using var client = new HttpClient { Timeout = TimeSpan.FromSeconds(5) };
var response = await client.GetAsync(endpoint);
return response.IsSuccessStatusCode;
}
catch
{
return false;
}
}
Designing for Zone Failure
Failure Scenarios
┌─────────────────────────────────────────────────────────────────────┐
│ ZONE FAILURE SCENARIOS │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Scenario 1: Single Zone Failure │
│ ───────────────────────────────── │
│ Zone 1 goes down → 66% capacity remains │
│ ✓ Load balancer removes Zone 1 from rotation │
│ ✓ Traffic redistributed to Zone 2 & 3 │
│ ✓ User experience: Minimal impact │
│ ✓ Action: Auto-scale to compensate │
│ │
│ Scenario 2: Zone Isolation │
│ ───────────────────────────── │
│ Network issue in Zone 1 → Zone 1 unreachable │
│ ✓ Health checks detect failure │
│ ✓ Traffic routes to healthy zones │
│ ✓ Database multi-zone: continues working │
│ ✓ User experience: Brief latency increase │
│ │
│ Scenario 3: Cascading Failure │
│ ──────────────────────────── │
│ Zone 1 fails → Traffic to Zone 2 & 3 │
│ Zones 2 & 3 get overloaded → CPU spikes │
│ ✓ Auto-scale triggers │
│ ✓ Queue buildup │
│ ✓ Circuit breaker prevents complete failure │
│ │
└─────────────────────────────────────────────────────────────────────┘
Circuit Breaker Pattern
public class CircuitBreaker
{
private int _failureCount = 0;
private readonly int _threshold = 5;
private CircuitState _state = CircuitState.Closed;
private DateTime _lastFailureTime;
public async Task<T> ExecuteAsync<T>(Func<Task<T>> operation)
{
if (_state == CircuitState.Open)
{
if (DateTime.UtcNow - _lastFailureTime > TimeSpan.FromMinutes(1))
{
_state = CircuitState.HalfOpen;
}
else
{
throw new CircuitOpenException();
}
}
try
{
var result = await operation();
_failureCount = 0;
_state = CircuitState.Closed;
return result;
}
catch (Exception ex)
{
_failureCount++;
_lastFailureTime = DateTime.UtcNow;
if (_failureCount >= _threshold)
{
_state = CircuitState.Open;
}
throw;
}
}
}
public enum CircuitState { Closed, Open, HalfOpen }
Cost Considerations
Zone-Redundant Costs
┌─────────────────────────────────────────────────────────────────────┐
│ ZONE REDUNDANCY COST IMPACT │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Virtual Machines: │
│ ──────────────── │
│ Single zone: 1 VM × $100 = $100/month │
│ Zone-redundant: 3 VMs × $100 = $300/month │
│ Increase: 200% │
│ │
│ Storage: │
│ ──────── │
│ LRS (single zone): $0.02/GB │
│ ZRS (3 zones): $0.03/GB │
│ Increase: 50% │
│ │
│ Cosmos DB: │
│ ────────── │
│ Single region: $0.008/RU │
│ Multi-region + zones: $0.012/RU │
│ Increase: 50% │
│ │
│ ROI: │
│ ────── │
│ ✓ Zero RTO for critical workloads │
│ ✓ No data loss (zone failure = no outage) │
│ ✓ Meets SLA requirements (99.99% with 3 zones) │
│ │
└─────────────────────────────────────────────────────────────────────┘
Best Practices
Implementation Checklist
| Practice | Description |
|---|---|
| Use ZRS for storage | Minimal cost increase for high durability |
| Deploy across 3 zones | Maximum availability for single region |
| Implement health checks | Detect zone failures quickly |
| Configure auto-scaling | Handle increased load after zone failure |
| Test failure scenarios | Regularly simulate zone outages |
| Monitor zone health | Create alerts for zone-specific issues |
SLA Expectations
┌─────────────────────────────────────────────────────────────────────┐
│ ZONE CONFIGURATION AND SLA │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Configuration │ SLA │ Annual Downtime │
│ ────────────────────────┼────────────┼────────────── │
│ Single VM │ 99.9% │ 8.76 hours │
│ Availability Set │ 99.95% │ 4.38 hours │
│ 2 Zones (minimum) │ 99.99% │ 52.6 minutes │
│ 3 Zones (recommended) │ 99.99%+ │ < 30 minutes │
│ Multi-region │ 99.999%+ │ < 5 minutes │
│ │
└─────────────────────────────────────────────────────────────────────┘
Related Topics
- Active-Active Multi-Region — Global deployment
- RTO & RPO Design — Recovery objectives
- Geo-DR for Service Bus — Messaging DR
Azure Integration Hub - Architect Level Multi-Region & High Availability