Availability Zones — Design for Zone Failure

Building Zone-Resilient Azure Architectures


Introduction

Availability Zones are physically separate data centers within an Azure region, each with independent power, cooling, and networking. Designing your architecture to handle zone failures ensures your applications remain available even when an entire data center experiences an outage. Understanding how to leverage Availability Zones effectively is crucial for achieving high availability in Azure.

This comprehensive guide covers:

  • Availability Zone concepts — Understanding zone architecture
  • Zone vs Region — When to use each
  • Azure services by zone support — What's available where
  • Implementation patterns — Configuring zone-redundant resources
  • Zone-aware coding — Designing for zone failures
  • Cost considerations — Balancing availability and budget

Understanding Availability Zones

How Availability Zones Work

┌────────────────────────────────────────────────────────────────────────┐
│                   AVAILABILITY ZONES ARCHITECTURE                      │
├────────────────────────────────────────────────────────────────────────┤
│                                                                        │
│   ┌────────────────────────────────────────────────────────────────┐   │
│   │                    AZURE REGION                                │   │
│   │                                                                │   │
│   │   ┌─────────────────┐  ┌─────────────────┐  ┌───────────────┐  │   │
│   │   │   ZONE 1        │  │   ZONE 2        │  │   ZONE 3      │  │   │
│   │   │                 │  │                 │  │               │  │   │
│   │   │  ┌──────────┐   │  │  ┌──────────┐   │  │  ┌──────────┐ │  │   │
│   │   │  │  Compute │   │  │  │  Compute │   │  │  │  Compute │ │  │   │
│   │   │  └──────────┘   │  │  └──────────┘   │  │  └──────────┘ │  │   │
│   │   │  ┌──────────┐   │  │  ┌──────────┐   │  │  ┌──────────┐ │  │   │
│   │   │  │  Network │   │  │  │  Network │   │  │  │  Network │ │  │   │
│   │   │  └──────────┘   │  │  └──────────┘   │  │  └──────────┘ │  │   │
│   │   │  ┌──────────┐   │  │  ┌──────────┐   │  │  ┌──────────┐ │  │   │
│   │   │  │ Storage  │   │  │  │ Storage  │   │  │  │ Storage  │ │  │   │
│   │   │  └──────────┘   │  │  └──────────┘   │  │  └──────────┘ │  │   │
│   │   │                 │  │                 │  │               │  │   │
│   │   └─────────────────┘  └─────────────────┘  └───────────────┘  │   │
│   │         │                     │                  │             │   │
│   │         └─────────────────────┼──────────────────┘             │   │
│   │                               │                                │   │
│   │                   Low-latency interconnect                     │   │
│   │                                                                │   │
│   └────────────────────────────────────────────────────────────────┘   │
│                                                                        │
│   Zone Failure Impact:                                                 │
│   - Zone 1 fails → Zone 2 & 3 remain operational                       │
│   - Applications span zones → zero downtime                            │
│                                                                        │
└────────────────────────────────────────────────────────────────────────┘

Region Support for Availability Zones

┌─────────────────────────────────────────────────────────────────────┐
│               REGIONS WITH AVAILABILITY ZONES                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Americas:                                                         │
│   ──────────                                                        │
│   ✓ East US, East US 2, West US 2, West US 3                        │
│   ✓ Central US, North Central US, South Central US                  │
│   ✓ Canada Central                                                  │
│                                                                     │
│   Europe:                                                           │
│   ────────                                                          │
│   ✓ West Europe, North Europe                                       │
│   ✓ UK South, France Central                                        │
│   ✓ Germany West Central, Norway East                               │
│                                                                     │
│   Asia Pacific:                                                     │
│   ─────────────                                                     │
│   ✓ Southeast Asia, East Asia                                       │
│   ✓ Japan East, Japan West                                          │
│   ✓ Australia East, Central India                                   │
│                                                                     │
│   Note: Not all regions support Availability Zones.                 │
│   Always check: az account list-locations --query "[].name"         │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Azure Services and Zone Support

Service Categories

┌─────────────────────────────────────────────────────────────────────┐
│                  AZURE SERVICES ZONE SUPPORT                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   ZONE-RESOURCES (Automatic Zone Distribution)                      │
│   ────────────────────────────────────────────                      │
│   ✓ Azure VMs (Zone-redundant scale sets)                           │
│   ✓ Managed Disks                                                   │
│   ✓ Public IP Addresses                                             │
│   ✓ Azure Kubernetes Service (-zone-redundant)                      │
│   ✓ Azure SQL (zone-redundant HA)                                   │
│   ✓ Azure Service Bus (Premium namespace)                           │
│                                                                     │
│   ZONE-SPECIFIC (Deploy to Specific Zone)                           │
│   ──────────────────────────────────────────                        │
│   ✓ App Service Plans (ASE)                                         │
│   ✓ Azure Functions (Elastic Premium)                               │
│   ✓ Key Vault (Standard/Premium)                                    │
│   ✓ Azure Storage (zone-redundant storage - ZRS)                    │
│   ✓ Event Hubs (dedicated clusters)                                 │
│                                                                     │
│   REGION-PAIRS (Cross-Region for DR)                                │
│   ─────────────────────────────────────                             │
│   ✓ Azure Backup                                                    │
│   ✓ Azure Site Recovery                                             │
│   ✓ Geo-redundant Storage (GRS)                                     │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Compute Services by Zone Support

ServiceZone SupportNotes
Virtual MachinesZone-redundantUse Availability Sets + Managed Disks
VM Scale SetsZone-redundantAutomatic distribution across zones
App ServiceZone-specificDeploy to multiple App Service Plans
Azure FunctionsZone-specificUse Premium plan with zone support
AKSZone-redundantSystem node pools across zones
Container AppsZone-specificDeploy to multiple environments

Implementation Patterns

Zone-Redundant Virtual Machines

# Create zone-redundant VM scale set
az vmss create \
  --name my-scaleset \
  --resource-group my-rg \
  --location eastus \
  --vm-sku Standard_D2s_v3 \
  --instance-count 3 \
  -- zones 1 2 3 \
  --upgrade-policy-mode Automatic \
  --load-balancer-sku Standard

# Create VM with zone-specific deployment
az vm create \
  --name myvm \
  --resource-group my-rg \
  --location eastus \
  --image UbuntuLTS \
  --size Standard_D2s_v3 \
  --zone 1

Zone-Redundant Storage (ZRS)

# Create storage account with ZRS
az storage account create \
  --name mystorage \
  --resource-group my-rg \
  --location eastus \
  --sku Standard_ZRS

# Create blob container with ZRS
az storage container create \
  --name mycontainer \
  --account-name mystorage \
  --failover-tier Standard

Zone-Redundant Cosmos DB

{
  "resource": {
    "type": "Microsoft.DocumentDB/databaseAccounts",
    "apiVersion": "2023-04-15",
    "location": "eastus",
    "properties": {
      "locations": [
        { "locationName": "eastus", "isZoneRedundant": true },
        { "locationName": "eastus2", "isZoneRedundant": true },
        { "locationName": "westus2", "isZoneRedundant": true }
      ]
    }
  }
}

Zone-Aware Application Design

Distribute Across Zones

public class ZoneAwareLoadBalancer
{
    private readonly List<ZoneEndpoint> _endpoints;
    private int _currentIndex = 0;

    public ZoneAwareLoadBalancer()
    {
        _endpoints = new List<ZoneEndpoint>
        {
            new ZoneEndpoint { Zone = "1", Endpoint = "https://app-eastus-1.azurewebsites.net" },
            new ZoneEndpoint { Zone = "2", Endpoint = "https://app-eastus-2.azurewebsites.net" },
            new ZoneEndpoint { Zone = "3", Endpoint = "https://app-eastus-3.azurewebsites.net" }
        };
    }

    public async Task<string> CallServiceAsync()
    {
        // Try each zone, fall back if one fails
        foreach (var endpoint in _endpoints)
        {
            try
            {
                return await CallEndpointAsync(endpoint);
            }
            catch (Exception ex)
            {
                Console.WriteLine($"Zone {endpoint.Zone} failed: {ex.Message}");
                // Continue to next zone
            }
        }

        throw new Exception("All zones unavailable");
    }

    private async Task<string> CallEndpointAsync(ZoneEndpoint endpoint)
    {
        using var client = new HttpClient();
        var response = await client.GetAsync($"{endpoint.Endpoint}/api/health");
        response.EnsureSuccessStatusCode();
        return await response.Content.ReadAsStringAsync();
    }
}

Health Check for Zones

[FunctionName("ZoneHealthCheck")]
public async Task<IActionResult> Run([HttpTrigger] HttpRequest req)
{
    var zones = new List<string> { "1", "2", "3" };
    var results = new List<ZoneHealth>();

    foreach (var zone in zones)
    {
        var isHealthy = await CheckZoneHealthAsync(zone);
        results.Add(new ZoneHealth { Zone = zone, IsHealthy = isHealthy });
    }

    var allHealthy = results.All(r => r.IsHealthy);
    var status = allHealthy ? 200 : 503;

    return new JsonResult(results) { StatusCode = status };
}

private async Task<bool> CheckZoneHealthAsync(string zone)
{
    try
    {
        var endpoint = $"https://app-eastus-{zone}.azurewebsites.net/api/health";
        using var client = new HttpClient { Timeout = TimeSpan.FromSeconds(5) };
        var response = await client.GetAsync(endpoint);
        return response.IsSuccessStatusCode;
    }
    catch
    {
        return false;
    }
}

Designing for Zone Failure

Failure Scenarios

┌─────────────────────────────────────────────────────────────────────┐
│                    ZONE FAILURE SCENARIOS                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Scenario 1: Single Zone Failure                                   │
│   ─────────────────────────────────                                 │
│   Zone 1 goes down → 66% capacity remains                           │
│   ✓ Load balancer removes Zone 1 from rotation                      │
│   ✓ Traffic redistributed to Zone 2 & 3                             │
│   ✓ User experience: Minimal impact                                 │
│   ✓ Action: Auto-scale to compensate                                │
│                                                                     │
│   Scenario 2: Zone Isolation                                        │
│   ─────────────────────────────                                     │
│   Network issue in Zone 1 → Zone 1 unreachable                      │
│   ✓ Health checks detect failure                                    │
│   ✓ Traffic routes to healthy zones                                 │
│   ✓ Database multi-zone: continues working                          │
│   ✓ User experience: Brief latency increase                         │
│                                                                     │
│   Scenario 3: Cascading Failure                                     │
│   ────────────────────────────                                      │
│   Zone 1 fails → Traffic to Zone 2 & 3                              │
│   Zones 2 & 3 get overloaded → CPU spikes                           │
│   ✓ Auto-scale triggers                                             │
│   ✓ Queue buildup                                                   │
│   ✓ Circuit breaker prevents complete failure                       │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Circuit Breaker Pattern

public class CircuitBreaker
{
    private int _failureCount = 0;
    private readonly int _threshold = 5;
    private CircuitState _state = CircuitState.Closed;
    private DateTime _lastFailureTime;

    public async Task<T> ExecuteAsync<T>(Func<Task<T>> operation)
    {
        if (_state == CircuitState.Open)
        {
            if (DateTime.UtcNow - _lastFailureTime > TimeSpan.FromMinutes(1))
            {
                _state = CircuitState.HalfOpen;
            }
            else
            {
                throw new CircuitOpenException();
            }
        }

        try
        {
            var result = await operation();
            _failureCount = 0;
            _state = CircuitState.Closed;
            return result;
        }
        catch (Exception ex)
        {
            _failureCount++;
            _lastFailureTime = DateTime.UtcNow;

            if (_failureCount >= _threshold)
            {
                _state = CircuitState.Open;
            }

            throw;
        }
    }
}

public enum CircuitState { Closed, Open, HalfOpen }

Cost Considerations

Zone-Redundant Costs

┌─────────────────────────────────────────────────────────────────────┐
│                 ZONE REDUNDANCY COST IMPACT                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Virtual Machines:                                                 │
│   ────────────────                                                  │
│   Single zone: 1 VM × $100 = $100/month                             │
│   Zone-redundant: 3 VMs × $100 = $300/month                         │
│   Increase: 200%                                                    │
│                                                                     │
│   Storage:                                                          │
│   ────────                                                          │
│   LRS (single zone): $0.02/GB                                       │
│   ZRS (3 zones): $0.03/GB                                           │
│   Increase: 50%                                                     │
│                                                                     │
│   Cosmos DB:                                                        │
│   ──────────                                                        │
│   Single region: $0.008/RU                                          │
│   Multi-region + zones: $0.012/RU                                   │
│   Increase: 50%                                                     │
│                                                                     │
│   ROI:                                                              │
│   ──────                                                            │
│   ✓ Zero RTO for critical workloads                                 │
│   ✓ No data loss (zone failure = no outage)                         │
│   ✓ Meets SLA requirements (99.99% with 3 zones)                    │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Best Practices

Implementation Checklist

PracticeDescription
Use ZRS for storageMinimal cost increase for high durability
Deploy across 3 zonesMaximum availability for single region
Implement health checksDetect zone failures quickly
Configure auto-scalingHandle increased load after zone failure
Test failure scenariosRegularly simulate zone outages
Monitor zone healthCreate alerts for zone-specific issues

SLA Expectations

┌─────────────────────────────────────────────────────────────────────┐
│                    ZONE CONFIGURATION AND SLA                       │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Configuration           │  SLA       │  Annual Downtime           │
│   ────────────────────────┼────────────┼──────────────              │
│   Single VM               │  99.9%     │  8.76 hours                │
│   Availability Set        │  99.95%    │  4.38 hours                │
│   2 Zones (minimum)       │  99.99%    │  52.6 minutes              │
│   3 Zones (recommended)   │  99.99%+   │  < 30 minutes              │
│   Multi-region            │  99.999%+  │  < 5 minutes               │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Related Topics


Azure Integration Hub - Architect Level Multi-Region & High Availability