Geo-DR for Service Bus & Storage

Disaster Recovery for Messaging and Data Infrastructure


Introduction

Geo-disaster recovery (Geo-DR) ensures business continuity when regional outages occur. For integration workloads relying on Azure Service Bus and Storage, understanding how to implement DR architecture is critical. Service Bus Premium namespaces can be geo-paired across regions, while Storage accounts provide built-in redundancy options that must be configured strategically.

This comprehensive guide covers:

  • Service Bus geo-replication — Geo-pairing and failover
  • Storage redundancy options — LRS, ZRS, GRS, GZRS
  • Data synchronization strategies — Keeping queues/topics in sync
  • Failover orchestration — Automated vs manual DR
  • Testing DR capabilities — Regular DR drills

Azure Service Bus Geo-DR

Geo-Pairing Architecture

┌─────────────────────────────────────────────────────────────────────┐
│              SERVICE BUS GEO-DR ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   PRIMARY REGION (East US)          SECONDARY REGION (West US)      │
│   ═══════════════════════════       ═══════════════════════════     │
│                                                                     │
│   ┌─────────────────────┐              ┌─────────────────────┐      │
│   │   Service Bus       │◀───Geo──────▶│   Service Bus       │      │
│   │   Namespace         │   Replication│   Namespace         │      │
│   │                     │              │                     │      │
│   │  ┌───────────────┐  │              │  ┌───────────────┐  │      │
│   │  │ orders-topic  │  │              │  │ orders-topic  │  │      │
│   │  └───────┬───────┘  │              │  └───────┬───────┘  │      │
│   │          │          │              │          │          │      │
│   │  ┌───────┴───────┐  │              │  ┌───────┴───────┐  │      │
│   │  │ orders-queue  │  │              │  │ orders-queue  │  │      │
│   │  └───────────────┘  │              │  └───────────────┘  │      │
│   └─────────┬──────────-┘              └─────────┬─────────-─┘      │
│             │                                    │                  │
│             ▼                                    ▼                  │
│   ┌──────────────────────────────────────────────────────┐          │
│   │              ALIAS (DNS-Safe Name)                   │          │
│   │   mynamespace.servicebus.windows.net (alias)         │          │
│   └──────────────────────────────────────────────────────┘          │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Geo-Pair Configuration

# Create primary namespace (must be Premium tier)
az servicebus namespace create \
  --name mynamespace-east \
  --resource-group my-rg \
  --location eastus \
  --sku-name Premium \
  --sku-capacity 1

# Create secondary namespace
az servicebus namespace create \
  --name mynamespace-west \
  --resource-group my-rg \
  --location westus2 \
  --sku-name Premium \
  --sku-capacity 1

# Create geo-pairing (alias)
az servicebus georecovery alias set \
  --resource-group my-rg \
  --namespace mynamespace-east \
  --alias mydr-alias \
  --partner-namespace mynamespace-west

# Check pairing status
az servicebus georecovery alias show \
  --resource-group my-rg \
  --namespace mynamespace-east \
  --alias mydr-alias

Failover Configuration

public class ServiceBusGeoDR
{
    private readonly ServiceBusClient _primaryClient;
    private readonly ServiceBusClient _secondaryClient;
    private readonly string _alias;

    public async Task<ServiceBusClient> GetActiveClientAsync()
    {
        var geoPair = await GetGeoPairStatusAsync();
        
        return geoPair.Role == "Primary" 
            ? _primaryClient 
            : _secondaryClient;
    }

    public async Task ForceFailoverAsync()
    {
        // Initiate geo-failover (changes alias to point to secondary)
        await az servicebus georecovery alias fail-over(
            resourceGroup: "my-rg",
            namespace: "mynamespace-west",
            alias: "mydr-alias");
    }

    private async Task<GeoPairStatus> GetGeoPairStatusAsync()
    {
        // Check current geo-pair status
        // Returns role (Primary/Secondary) and state
    }
}

Failover Behavior

┌─────────────────────────────────────────────────────────────────────┐
│                    GEO-DR FAILOVER BEHAVIOR                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   What is replicated:                                               │
│   ✓ Queue definitions and properties                                │
│   ✓ Topic definitions                                               │
│   ✓ Subscription rules                                              │
│   ✓ Metadata (not messages in queue)                                │
│                                                                     │
│   What is NOT replicated:                                           │
│   ✗ Active messages in queues                                       │
│   ✗ Active messages in topics                                       │
│   ✗ Peek-lock messages                                              │
│                                                                     │
│   ⚠️ IMPORTANT: Messages must be drained or handled                 │
│   separately before failover for critical workloads                 │
│                                                                     │
│   RTO: ~60 seconds (DNS TTL + propagation)                          │
│   RPO: Metadata only (application-level for data)                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Azure Storage Redundancy Options

Redundancy Comparison

┌─────────────────────────────────────────────────────────────────────┐
│               STORAGE REDUNDANCY OPTIONS                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   LRS (Locally Redundant Storage)                                   │
│   ─────────────────────────────────                                 │
│   - 3 copies in single data center                                  │
│   - Lowest cost                                                     │
│   - Protects against hardware failure                               │
│   - Use: Non-critical, easily recreatable data                      │
│                                                                     │
│   ZRS (Zone-Redundant Storage)                                      │
│   ────────────────────────────                                      │
│   - 3 copies across 3 availability zones                            │
│   - Protects against zone failure                                   │
│   - ~20% premium over LRS                                           │
│   - Use: Moderate critical, zone resilience needed                  │
│                                                                     │
│   GRS (Geo-Redundant Storage)                                       │
│   ───────────────────────────                                       │
│   - LRS in primary + LRS in secondary region                        │
│   - ~20% premium over LRS                                           │
│   - Manual failover required                                        │
│   - Use: DR required, RPO < 24 hours                                │
│                                                                     │
│   GZRS (Geo-Zone-Redundant Storage)                                 │
│   ──────────────────────────────                                    │
│   - ZRS in primary + LRS in secondary                               │
│   - Best balance of protection and cost                             │
│   - ~30% premium over LRS                                           │
│   - Use: Highest protection, zone + region                          │
│                                                                     │
│   RA-GRS (Read-Access Geo-Redundant)                                │
│   ──────────────────────────────────────                            │
│   - Same as GRS + read access to secondary                          │
│   - Secondary endpoint for read-only operations                     │
│   - Use: Read-heavy workloads, lower primary load                   │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Storage Account Configuration

# Create storage with GZRS (recommended for DR)
az storage account create \
  --name mystorageaccount \
  --resource-group my-rg \
  --location eastus \
  --sku Standard_GZRS \
  --enable-hierarchical-namespace true

# Convert existing account to GZRS
az storage account update \
  --name mystorageaccount \
  --resource-group my-rg \
  --sku Standard_GZRS

# Check replication status
az storage account show \
  --name mystorageaccount \
  --resource-group my-rg \
  --query "primaryEndpoints"

Storage Failover

public class StorageFailoverManager
{
    public async Task InitiateFailoverAsync(string storageAccountName)
    {
        // Check if account supports failover
        var account = await GetStorageAccountAsync(storageAccountName);
        
        if (!account.SupportGeoFailover)
            throw new InvalidOperationException(
                "Account does not support geo-failover");

        // Initiate failover
        await az storage account failover \
            --name mystorageaccount \
            --resource-group my-rg

        // Update connection strings in applications
        // Point to new primary (old secondary)
    }

    public async Task<bool> IsFailoverNeededAsync()
    {
        var health = await GetAccountHealthAsync();
        
        // Check if primary region is unavailable
        return health.PrimaryRegionStatus == "Unavailable";
    }
}

Data Synchronization Strategies

Queue Message Replication

public class CrossRegionQueueSynchronizer
{
    private readonly ServiceBusClient _primaryClient;
    private readonly ServiceBusClient _secondaryClient;

    public async Task ReplicateMessagesAsync(string queueName)
    {
        var receiver = _primaryClient.CreateReceiver(queueName);
        var sender = _secondaryClient.CreateSender(queueName);

        while (true)
        {
            var messages = await receiver.ReceiveMessagesAsync(
                maxMessages: 100,
                maxWaitTime: TimeSpan.FromSeconds(5));

            foreach (var message in messages)
            {
                var replicatedMessage = new ServiceBusMessage(message)
                {
                    ApplicationProperties = 
                        new Dictionary<string, object>
                        {
                            { "ReplicatedFrom", "primary" },
                            { "OriginalMessageId", message.MessageId },
                            { "ReplicationTime", DateTime.UtcNow }
                        }
                };

                await sender.SendMessageAsync(replicatedMessage);
                await receiver.CompleteMessageAsync(message.MessageId);
            }

            if (messages.Count == 0) break;
        }
    }
}

Blob Replication with AzCopy

# Configure blob replication rule
az storage account blob-service-properties update \
  --account-name mystorage \
  --resource-group my-rg \
  --enable-blob-change-feed true

# Use AzCopy for bulk replication
azcopy copy \
  "https://mystorageeast.blob.core.windows.net/container SasToken" \
  "https://mystoragewest.blob.core.windows.net/container" \
  --recursive \
  --preserve-smb-properties

# Schedule replication via Azure Functions
# Use Event Grid blob events to trigger replication

DR Testing and Validation

Test Runbook

# 1. Verify geo-pair status
az servicebus georecovery alias show \
  --resource-group my-rg \
  --namespace mynamespace-east \
  --alias mydr-alias

# 2. Check storage replication status
az storage account show \
  --name mystorage \
  --resource-group my-rg \
  --query "provisioningState, secondaryLocation"

# 3. Test connectivity to secondary
nslookup mynamespace-west.servicebus.windows.net

# 4. Verify DNS resolution for alias
nslookup mydr-alias.servicebus.windows.net

DR Drill Checklist

┌─────────────────────────────────────────────────────────────────────┐
│                    DR DRILL CHECKLIST                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│   Pre-Drill:                                                        │
│   □ Notify stakeholders of test window                              │
│   □ Document current configuration                                  │
│   □ Verify backup of critical data                                  │
│   □ Prepare rollback plan                                           │
│                                                                     │
│   During Drill:                                                     │
│   □ Simulate region failure                                         │
│   □ Verify failover triggers correctly                              │
│   □ Confirm alias points to secondary                               │
│   □ Test application connectivity                                   │
│   □ Validate message routing                                        │
│   □ Check storage accessibility                                     │
│                                                                     │
│   Post-Drill:                                                       │
│   □ Failback to primary region                                      │
│   □ Verify data consistency                                         │
│   □ Document lessons learned                                        │
│   □ Update runbooks                                                 │
│   □ Share results with stakeholders                                 │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

Best Practices

Implementation Checklist

PracticeDescription
Premium tier for SBGeo-DR requires Service Bus Premium namespace
GZRS for storageUse GZRS for best balance of protection
Manual failoverControl when failover occurs for data consistency
Application-level replicationReplicate critical messages for lower RPO
Regular testingQuarterly DR drills minimum
Runbook documentationDocument exact steps for each failure scenario

Architecture Decision

{
  "geo-dr-architecture": {
    "serviceBus": {
      "tier": "Premium",
      "geoPairing": true,
      "replication": "Metadata-only",
      "rto": "60 seconds",
      "rpo": "Application dependent"
    },
    "storage": {
      "redundancy": "GZRS",
      "failover": "Manual",
      "replication": "Built-in",
      "rto": "Seconds to minutes",
      "rpo": "Near real-time"
    },
    "application": {
      "messageReplication": "Critical queues only",
      "connectionStringUpdate": "Use alias, not direct endpoint"
    }
  }
}

Related Topics


Azure Integration Hub - Architect Level Multi-Region & High Availability