Why Consumer Groups and Partitions Matter
Azure Event Hubs is a streaming platform, not a message queue. Understanding this distinction is the foundation for everything else in this article.
The Streaming Model vs. Traditional Queues
In a traditional queue (like Azure Service Bus queues), a message is consumed once and then removed. If three services need the same message, you need topics with three subscriptions. The broker tracks delivery state, handles acknowledgments, and removes messages after processing.
Event Hubs works differently. Events are appended to a log and retained for a configurable period (1–90 days, or indefinitely with Event Hubs Dedicated). Consumers read from the log at their own pace using an offset — a pointer to their position in the stream. The broker does not track who has read what; consumers manage their own position.
This means:
- Multiple readers can consume the same data independently without interfering with each other.
- Events are not deleted after reading — they expire based on retention policy.
- Consumers can rewind to reprocess historical data at any time.
- Throughput scales horizontally because the log is split into partitions.
Why Partitions Exist
A single log would become a bottleneck. Partitions solve this by splitting the event stream into multiple parallel logs:
- Parallelism — Multiple consumers can read different partitions simultaneously.
- Throughput — Each partition supports up to 1 MB/s ingress and 2 MB/s egress (Standard tier). More partitions = more aggregate throughput.
- Ordering guarantee — Events within a single partition are strictly ordered. Events across partitions have no ordering guarantee.
When a producer sends an event with a partition key (e.g., a device ID or tenant ID), Event Hubs hashes the key and routes all events with that key to the same partition. This guarantees ordering for related events without requiring a single global order.
Why Consumer Groups Exist
A consumer group represents a logical subscriber to the event hub. Each consumer group maintains its own independent read position across all partitions.
Without consumer groups, you'd have a single reader position. If your real-time alerting system and your analytics pipeline both need the same events, they'd fight over the offset. Consumer groups solve this — each gets its own view of the stream.
Think of it like a DVR recording: multiple people can watch the same show at different points, fast-forward, or rewind independently.
Architecture Overview
┌───────────────────────────────────────────────────────────────────────────┐
│ PRODUCERS │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ IoT Dev │ │ Web App │ │ Service A│ │ Service B│ │
│ │ key:"d1" │ │ key:"u5" │ │ key:"d2" │ │ key:"u5" │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │
└────────┼───────────────┼───────────────┼───────────────┼──────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────────┐
│ EVENT HUB: "telemetry-hub" │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Partition 0 │ │ Partition 1 │ │ Partition 2 │ │ Partition 3 │ │
│ │ │ │ │ │ │ │ │ │
│ │ [e1][e4][e7]│ │ [e2][e5] │ │ [e3][e6][e8]│ │ [e9][e10] │ │
│ │ offset: 3 │ │ offset: 2 │ │ offset: 3 │ │ offset: 2 │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
└──────────┼───────────────┼───────────────┼───────────────┼───────────────┘
│ │ │ │
┌─────┴───────────────┴───────────────┴───────────────┴─────┐
│ │
▼ ▼
┌──────────────────────────────────────┐ ┌──────────────────────────────────┐
│ CONSUMER GROUP A: "realtime-alerts" │ │ CONSUMER GROUP B: "analytics" │
│ (3 instances) │ │ (2 instances) │
│ │ │ │
│ ┌──────────┐ Owns: P0, P1 │ │ ┌──────────┐ Owns: P0, P1 │
│ │Instance 1│ Offset P0: 2 │ │ │Instance 1│ Offset P0: 1 │
│ │ │ Offset P1: 2 │ │ │ │ Offset P1: 2 │
│ └──────────┘ │ │ └──────────┘ │
│ │ │ │
│ ┌──────────┐ Owns: P2 │ │ ┌──────────┐ Owns: P2, P3 │
│ │Instance 2│ Offset P2: 1 │ │ │Instance 2│ Offset P2: 3 │
│ │ │ Offset P2: 1 │ │ │ │ Offset P3: 2 │
│ └──────────┘ │ │ └──────────┘ │
│ │ │ │
│ ┌──────────┐ Owns: P3 │ └──────────────────────────────────┘
│ │Instance 3│ Offset P3: 0 │
│ │ │ (behind — lag!) │
│ └──────────┘ │
│ │
└──────────────────────────────────────┘
Key observations:
• Each consumer group tracks offsets independently
• Consumer Group A is at different offsets than Consumer Group B
• Within a group, each partition is owned by exactly ONE instance
• Instance 3 in Group A has lag (offset 0 vs latest 2) — it's behind
• Group B's Instance 2 is fully caught up on P2 (offset 3 = latest)
How Consumer Groups Work
Independent Read Positions
Each consumer group maintains its own checkpoint (offset) for every partition. This means:
- Consumer Group A can be processing events from 5 minutes ago (real-time alerting).
- Consumer Group B can be processing events from 2 hours ago (batch analytics catching up).
- Consumer Group C can rewind to yesterday to reprocess after a bug fix.
None of these interfere with each other. The events remain in the partition until the retention period expires, regardless of how many consumer groups have read them.
The $Default Consumer Group
Every Event Hub comes with a built-in consumer group called $Default. It's fine for development
and simple scenarios, but in production you should create dedicated consumer groups for each
logical subscriber. This provides:
- Isolation — One slow consumer doesn't affect another group's throughput.
- Independent checkpointing — Each group tracks its own progress.
- Clarity — Named groups make monitoring and debugging easier.
Partition Distribution Within a Consumer Group
Within a single consumer group, partitions are distributed among consumer instances:
- One partition → one consumer (at any given time). A partition is never read by two consumers in the same group simultaneously.
- One consumer → one or more partitions. A single consumer can own multiple partitions.
- Max useful consumers = number of partitions. If you have 4 partitions and 6 consumers, 2 consumers will be idle.
This is the fundamental scaling constraint: the number of partitions sets the maximum parallelism within a consumer group.
Consumer Group Limits
| Tier | Max Consumer Groups per Event Hub |
|---|---|
| Basic | 1 ($Default only) |
| Standard | 20 |
| Premium | 100 |
| Dedicated | 1000 |
Step-by-Step Implementation
Creating Consumer Groups
Using Azure CLI:
# Create a consumer group
az eventhubs eventhub consumer-group create \
--resource-group myResourceGroup \
--namespace-name myNamespace \
--eventhub-name telemetry-hub \
--name realtime-alerts
# Create another consumer group for analytics
az eventhubs eventhub consumer-group create \
--resource-group myResourceGroup \
--namespace-name myNamespace \
--eventhub-name telemetry-hub \
--name analytics-pipeline
# List consumer groups
az eventhubs eventhub consumer-group list \
--resource-group myResourceGroup \
--namespace-name myNamespace \
--eventhub-name telemetry-hub \
--output table
EventProcessorClient Setup (C#)
The EventProcessorClient is the recommended way to consume events. It handles partition
balancing, checkpointing, and failover automatically.
using Azure.Messaging.EventHubs;
using Azure.Messaging.EventHubs.Consumer;
using Azure.Messaging.EventHubs.Processor;
using Azure.Storage.Blobs;
public class EventHubConsumerService : IHostedService
{
private readonly EventProcessorClient _processor;
private readonly ILogger<EventHubConsumerService> _logger;
public EventHubConsumerService(IConfiguration config, ILogger<EventHubConsumerService> logger)
{
_logger = logger;
// Blob container stores checkpoint data (offsets per partition)
var blobClient = new BlobContainerClient(
config["Storage:ConnectionString"],
"eventhub-checkpoints");
_processor = new EventProcessorClient(
blobClient,
"realtime-alerts", // consumer group name
config["EventHub:ConnectionString"],
"telemetry-hub"); // event hub name
// Register handlers
_processor.ProcessEventAsync += ProcessEventAsync;
_processor.ProcessErrorAsync += ProcessErrorAsync;
_processor.PartitionInitializingAsync += PartitionInitializingAsync;
_processor.PartitionClosingAsync += PartitionClosingAsync;
}
public async Task StartAsync(CancellationToken cancellationToken)
{
await _processor.StartProcessingAsync(cancellationToken);
_logger.LogInformation("Event processor started");
}
public async Task StopAsync(CancellationToken cancellationToken)
{
await _processor.StopProcessingAsync(cancellationToken);
_logger.LogInformation("Event processor stopped");
}
}
Processing Events with Checkpointing
private async Task ProcessEventAsync(ProcessEventArgs args)
{
if (args.CancellationToken.IsCancellationRequested)
return;
try
{
// Deserialize the event
var eventBody = args.Data.EventBody.ToString();
var telemetry = JsonSerializer.Deserialize<TelemetryEvent>(eventBody);
// Process the event (your business logic)
await ProcessTelemetryAsync(telemetry);
// Checkpoint every 10 events to reduce storage calls
if (args.Data.SequenceNumber % 10 == 0)
{
await args.UpdateCheckpointAsync();
_logger.LogDebug(
"Checkpointed partition {PartitionId} at sequence {Sequence}",
args.Partition.PartitionId,
args.Data.SequenceNumber);
}
}
catch (Exception ex)
{
// Don't checkpoint on failure — event will be reprocessed
_logger.LogError(ex,
"Error processing event from partition {PartitionId}, sequence {Sequence}",
args.Partition.PartitionId,
args.Data.SequenceNumber);
}
}
Handling Partition Lifecycle Events
private Task PartitionInitializingAsync(PartitionInitializingEventArgs args)
{
_logger.LogInformation(
"Partition {PartitionId} initializing. Default position: {Position}",
args.PartitionId,
args.DefaultStartingPosition);
// Optionally override the starting position if no checkpoint exists
if (args.DefaultStartingPosition == default)
{
args.DefaultStartingPosition = EventPosition.Latest;
// Use EventPosition.Earliest to process all retained events
// Use EventPosition.Latest to only process new events
}
return Task.CompletedTask;
}
private Task PartitionClosingAsync(PartitionClosingEventArgs args)
{
_logger.LogInformation(
"Partition {PartitionId} closing. Reason: {Reason}",
args.PartitionId,
args.Reason);
// Reason can be: OwnershipLost, Shutdown, or other
// Use this to flush buffers or release resources tied to a partition
return Task.CompletedTask;
}
Error Handling
private Task ProcessErrorAsync(ProcessErrorEventArgs args)
{
_logger.LogError(args.Exception,
"Error in partition {PartitionId}, operation: {Operation}",
args.PartitionId,
args.Operation);
// Common errors:
// - EventHubsException with Reason == ResourceNotFound: hub doesn't exist
// - EventHubsException with Reason == ConsumerGroupNotFound: group doesn't exist
// - OperationCanceledException: processor is shutting down (safe to ignore)
// The processor will automatically retry transient errors.
// Only unrecoverable errors need manual intervention.
return Task.CompletedTask;
}
Partition Balancing
How EventProcessorClient Distributes Partitions
The EventProcessorClient uses a cooperative balancing algorithm:
- Each processor instance periodically checks the ownership table (stored in Blob Storage).
- It counts total partitions and total active processors.
- It calculates the ideal distribution:
partitions / processors(rounded up for some). - If it owns fewer partitions than its fair share, it claims unclaimed partitions.
- If all partitions are claimed but distribution is uneven, it may steal a partition.
The ownership table in Blob Storage looks like this:
Container: eventhub-checkpoints
├── ownership/
│ ├── 0 → { "ownerId": "instance-A", "lastModified": "..." }
│ ├── 1 → { "ownerId": "instance-A", "lastModified": "..." }
│ ├── 2 → { "ownerId": "instance-B", "lastModified": "..." }
│ └── 3 → { "ownerId": "instance-C", "lastModified": "..." }
└── checkpoint/
├── 0 → { "offset": "1024", "sequenceNumber": 42 }
├── 1 → { "offset": "2048", "sequenceNumber": 87 }
├── 2 → { "offset": "512", "sequenceNumber": 21 }
└── 3 → { "offset": "768", "sequenceNumber": 35 }
Rebalancing: When Consumers Join or Leave
New consumer joins (scale-out):
Before (2 consumers, 4 partitions):
Instance A: P0, P1, P2, P3 ← overloaded
Instance B: (just started)
Balancing cycle 1:
Instance A: P0, P1, P2 ← releases P3
Instance B: P3 ← claims P3
Balancing cycle 2:
Instance A: P0, P1 ← releases P2
Instance B: P2, P3 ← claims P2
Final state (balanced):
Instance A: P0, P1
Instance B: P2, P3
Consumer leaves (scale-in or crash):
Before (3 consumers, 4 partitions):
Instance A: P0, P1
Instance B: P2 ← crashes!
Instance C: P3
After (ownership expires after ~30s):
Instance A: P0, P1, P2 ← claims orphaned P2
Instance C: P3
Next balancing cycle:
Instance A: P0, P1 ← releases P2
Instance C: P2, P3 ← claims P2
Partition Ownership and Lease Management
Ownership is maintained through blob leases with ETags for optimistic concurrency:
- Each processor renews its ownership claims every 30 seconds (default).
- If a processor fails to renew within the interval, its partitions become available.
- Ownership claims use ETag-based conditional writes to prevent conflicts.
- Two processors cannot simultaneously own the same partition.
Handling Partition Steal Scenarios
When a partition is "stolen" (reassigned during rebalancing), your processor receives a
PartitionClosingAsync event with Reason = OwnershipLost. Handle this gracefully:
private Task PartitionClosingAsync(PartitionClosingEventArgs args)
{
if (args.Reason == ProcessingStoppedReason.OwnershipLost)
{
_logger.LogWarning(
"Partition {PartitionId} was reassigned to another instance",
args.PartitionId);
// Flush any buffered data for this partition
// Cancel any in-progress work for this partition
// Do NOT checkpoint — the new owner will resume from last checkpoint
}
return Task.CompletedTask;
}
Load Balancing Strategies
The EventProcessorClient supports configuring the load balancing approach:
var options = new EventProcessorClientOptions
{
// How often the processor checks for rebalancing
LoadBalancingUpdateInterval = TimeSpan.FromSeconds(10),
// How long before an inactive ownership is considered expired
PartitionOwnershipExpirationInterval = TimeSpan.FromSeconds(30),
// Balancing strategy
LoadBalancingStrategy = LoadBalancingStrategy.Balanced
// Balanced: gradually converge to even distribution (default)
// Greedy: claim all available partitions immediately
};
Balanced strategy (default):
- Claims one unclaimed partition per cycle.
- Steals one partition per cycle if distribution is uneven.
- Slower to converge but less disruptive.
- Best for stable, long-running consumers.
Greedy strategy:
- Claims all unclaimed partitions immediately.
- Faster convergence after a crash or scale event.
- More disruptive — causes more partition handoffs.
- Best for scenarios where fast recovery matters more than stability.
| Scenario | Recommended Strategy |
|---|---|
| Stable production workload | Balanced |
| Frequent auto-scaling | Greedy |
| Crash recovery is critical | Greedy |
| Minimizing reprocessing | Balanced |
| Development/testing | Greedy |
Scaling Strategies
Scaling Consumers: The Partition Ceiling
The fundamental rule: maximum useful consumers in a consumer group = number of partitions.
4 partitions, 2 consumers → each consumer handles 2 partitions ✓
4 partitions, 4 consumers → each consumer handles 1 partition ✓ (max parallelism)
4 partitions, 6 consumers → 4 active + 2 idle (wasted) ✗
Plan your partition count based on your peak parallelism needs, not your current load.
When to Add Partitions
⚠️ Partitions can be added but never removed. This is a one-way operation.
Add partitions when:
- Consumer lag is consistently growing despite having max consumers deployed.
- You need more aggregate throughput (each partition adds 1 MB/s in, 2 MB/s out).
- You're planning for future growth and want headroom for more consumers.
# Increase partition count (cannot decrease!)
az eventhubs eventhub update \
--resource-group myResourceGroup \
--namespace-name myNamespace \
--name telemetry-hub \
--partition-count 8
# Check current partition count
az eventhubs eventhub show \
--resource-group myResourceGroup \
--namespace-name myNamespace \
--name telemetry-hub \
--query partitionCount
Impact of adding partitions:
- New partitions start empty — no historical data is redistributed.
- Existing partition keys may now hash to new partitions (ordering disruption for new events).
- Consumers will automatically pick up new partitions during the next balancing cycle.
- No downtime required.
Throughput Units and Partitions
| Resource | Per Throughput Unit (TU) | Per Partition |
|---|---|---|
| Ingress | 1 MB/s or 1000 events/s | 1 MB/s |
| Egress | 2 MB/s | 2 MB/s |
| Max partitions | — | — |
Relationship: You need enough TUs to cover your aggregate throughput AND enough partitions to distribute the load. Rule of thumb:
Minimum TUs needed = max(total ingress MB/s, total egress MB/s ÷ 2)
Minimum partitions = max(desired consumer parallelism, TUs needed)
Auto-Inflate for Handling Spikes
Auto-inflate automatically scales throughput units up (but not down) when load increases:
# Enable auto-inflate with max 20 TUs
az eventhubs namespace update \
--resource-group myResourceGroup \
--name myNamespace \
--enable-auto-inflate true \
--maximum-throughput-units 20
- Scales up within seconds when throttling is detected.
- Does not scale down automatically — you must manually reduce TUs or use automation.
- Set a maximum to control costs.
- Consider Premium tier for truly elastic scaling without TU management.
Scaling Producers with Partition Keys
Choose partition keys carefully to distribute load evenly:
// Good: high-cardinality key distributes evenly
await producerClient.SendAsync(new[]
{
new EventData(payload)
}, new SendEventOptions { PartitionKey = deviceId });
// Bad: low-cardinality key creates hot partitions
await producerClient.SendAsync(new[]
{
new EventData(payload)
}, new SendEventOptions { PartitionKey = region }); // only 4 regions = uneven!
// Alternative: round-robin (no partition key) for max throughput
await producerClient.SendAsync(new[] { new EventData(payload) });
// Events distributed evenly but no ordering guarantee
Checkpointing Strategies
Checkpoint Frequency Trade-offs
Checkpointing writes the current offset to Blob Storage. The frequency creates a trade-off:
| Strategy | Pros | Cons |
|---|---|---|
| Every event | Minimal reprocessing on crash | High storage I/O, slower |
| Every N events (e.g., 10) | Balanced | Up to N events reprocessed |
| Every N seconds | Predictable I/O cost | Variable reprocessing window |
| On batch completion | Atomic batch semantics | Entire batch reprocessed on fail |
Recommended approach — checkpoint every N events OR every N seconds, whichever comes first:
private int _eventsSinceCheckpoint = 0;
private DateTime _lastCheckpoint = DateTime.UtcNow;
private async Task ProcessEventAsync(ProcessEventArgs args)
{
await ProcessBusinessLogicAsync(args.Data);
_eventsSinceCheckpoint++;
bool shouldCheckpoint =
_eventsSinceCheckpoint >= 100 ||
(DateTime.UtcNow - _lastCheckpoint) > TimeSpan.FromSeconds(30);
if (shouldCheckpoint)
{
await args.UpdateCheckpointAsync();
_eventsSinceCheckpoint = 0;
_lastCheckpoint = DateTime.UtcNow;
}
}
Exactly-Once vs. At-Least-Once
Event Hubs provides at-least-once delivery by default. After a crash, events between the last checkpoint and the crash point will be redelivered. True exactly-once requires your processing to be idempotent.
Idempotent processing pattern:
private async Task ProcessEventAsync(ProcessEventArgs args)
{
var eventId = args.Data.MessageId
?? $"{args.Partition.PartitionId}-{args.Data.SequenceNumber}";
// Check if already processed (using a deduplication store)
if (await _deduplicationStore.HasBeenProcessedAsync(eventId))
{
_logger.LogDebug("Skipping duplicate event {EventId}", eventId);
return;
}
// Process the event
await ProcessBusinessLogicAsync(args.Data);
// Mark as processed (ideally in the same transaction as your business logic)
await _deduplicationStore.MarkProcessedAsync(eventId);
await args.UpdateCheckpointAsync();
}
Transactional outbox pattern for database writes:
// Process event and record dedup key in a single database transaction
using var transaction = await _db.BeginTransactionAsync();
await _db.InsertTelemetryAsync(telemetry, transaction);
await _db.InsertProcessedEventAsync(eventId, transaction);
await transaction.CommitAsync();
await args.UpdateCheckpointAsync();
Real-World Scenarios
Scenario 1: IoT Telemetry Platform
Devices (100K+) → Event Hub (32 partitions)
├── Consumer Group: "realtime-alerts" (8 instances)
│ └── Checks thresholds, triggers alerts within seconds
├── Consumer Group: "analytics" (4 instances)
│ └── Aggregates to Azure Data Explorer every 5 minutes
├── Consumer Group: "cold-storage" (2 instances)
│ └── Archives raw events to ADLS Gen2
└── Consumer Group: "device-state" (8 instances)
└── Maintains latest-known-state per device in Redis
- Partition key: Device ID (ensures all events from one device go to same partition).
- 32 partitions: Supports up to 32 parallel consumers per group.
- Alerts group has 8 instances for low latency; checkpoints every event.
- Analytics group has 4 instances; checkpoints every 5 minutes (batch-oriented).
- Cold storage only needs 2 instances — archival is I/O-bound, not CPU-bound.
Scenario 2: Click-Stream Processing
Web/Mobile Apps → Event Hub (16 partitions)
├── Consumer Group: "realtime-dashboard" (4 instances)
│ └── Updates live user count, trending pages
├── Consumer Group: "personalization" (8 instances)
│ └── Feeds ML model for real-time recommendations
└── Consumer Group: "batch-etl" (2 instances)
└── Loads to data warehouse every hour
- Partition key: Session ID (keeps a user's session events together).
- Personalization needs high parallelism for ML inference latency.
- Batch ETL runs fewer instances since it processes in large batches.
Scenario 3: Log Aggregation
Microservices (50+) → Event Hub (8 partitions)
├── Consumer Group: "search-index" (4 instances)
│ └── Indexes to Elasticsearch for search
├── Consumer Group: "anomaly-detection" (4 instances)
│ └── ML-based anomaly detection pipeline
└── Consumer Group: "compliance" (2 instances)
└── Filters and archives PII-containing logs
- Partition key: Service name (keeps logs from one service ordered).
- Compliance group processes slowly (PII detection is expensive) but doesn't block others.
Consumer Lag Monitoring
Consumer lag is the difference between the latest event in a partition and the consumer's current checkpoint. Growing lag means your consumer can't keep up.
Detecting Lag Programmatically
public async Task<Dictionary<string, long>> GetConsumerLagAsync(
string connectionString,
string eventHubName,
string consumerGroup,
BlobContainerClient checkpointContainer)
{
var lag = new Dictionary<string, long>();
await using var consumer = new EventHubConsumerClient(
consumerGroup, connectionString, eventHubName);
var partitionIds = await consumer.GetPartitionIdsAsync();
foreach (var partitionId in partitionIds)
{
var properties = await consumer.GetPartitionPropertiesAsync(partitionId);
var lastEnqueued = properties.LastEnqueuedSequenceNumber;
// Read checkpoint from blob storage
var checkpointBlob = checkpointContainer.GetBlobClient(
$"{eventHubName}/{consumerGroup}/checkpoint/{partitionId}");
long consumerSequence = 0;
if (await checkpointBlob.ExistsAsync())
{
var props = await checkpointBlob.GetPropertiesAsync();
if (props.Value.Metadata.TryGetValue("sequenceNumber", out var seq))
consumerSequence = long.Parse(seq);
}
lag[partitionId] = lastEnqueued - consumerSequence;
}
return lag;
}
Lag Alert Thresholds
| Lag Level | Events Behind | Action |
|---|---|---|
| Healthy | < 100 | No action needed |
| Warning | 100–1,000 | Monitor — may need more consumers |
| Critical | 1,000–10,000 | Scale out consumers immediately |
| Emergency | > 10,000 | Risk of data loss (retention expiry) |
Best Practices
| Area | Recommendation |
|---|---|
| Partition count | Start with 4–8; scale to 32 for high-throughput workloads |
| Consumer groups | One per logical subscriber; never share $Default in production |
| Partition keys | Use high-cardinality keys (device ID, user ID) for even spread |
| Checkpointing | Every 100 events or 30 seconds, whichever comes first |
| Error handling | Never let exceptions escape ProcessEventAsync |
| Idempotency | Design all consumers to handle duplicate events |
| Scaling | Set partition count to your peak parallelism need |
| Monitoring | Alert on consumer lag > 1000 events |
| Blob storage | Use a dedicated container per consumer group for checkpoints |
| Graceful shutdown | Call StopProcessingAsync to checkpoint before exit |
Common Pitfalls
1. More consumers than partitions Adding a 5th consumer to a 4-partition hub wastes resources. The extra consumer sits idle.
2. Checkpointing inside a try/catch that swallows errors If you checkpoint after a failed processing attempt, you'll skip events permanently.
3. Using $Default for multiple applications Two apps sharing $Default will fight over partition ownership. Create separate consumer groups.
4. Low-cardinality partition keys Using "region" (4 values) as a partition key with 32 partitions means 28 partitions stay empty.
5. Not handling OwnershipLost When a partition is reassigned, in-flight work should be cancelled — not completed and checkpointed.
6. Synchronous processing blocking the event loop
The EventProcessorClient uses async handlers. Blocking calls (.Result, .Wait()) can deadlock.
7. Forgetting that partitions can't be removed Adding partitions is irreversible. Start conservative and scale up based on measured need.
8. Not accounting for reprocessing after crashes If you checkpoint every 100 events and crash at event 99, all 99 events will be redelivered. Design your processing to be idempotent.
Monitoring with KQL
Consumer Lag per Partition
// Query Azure Monitor / Event Hubs metrics
AzureMetrics
| where ResourceProvider == "MICROSOFT.EVENTHUB"
| where MetricName == "IncomingMessages"
| summarize TotalIncoming = sum(Total) by bin(TimeGenerated, 5m), EntityName
| join kind=leftouter (
AzureMetrics
| where MetricName == "OutgoingMessages"
| summarize TotalOutgoing = sum(Total) by bin(TimeGenerated, 5m), EntityName
) on TimeGenerated, EntityName
| extend Lag = TotalIncoming - TotalOutgoing
| project TimeGenerated, EntityName, TotalIncoming, TotalOutgoing, Lag
| order by TimeGenerated desc
Throughput Monitoring
// Incoming vs outgoing bytes over time
AzureMetrics
| where ResourceProvider == "MICROSOFT.EVENTHUB"
| where MetricName in ("IncomingBytes", "OutgoingBytes")
| summarize AvgBytes = avg(Average) by bin(TimeGenerated, 5m), MetricName
| render timechart
Throttling Detection
// Detect throttled requests (exceeded TU limits)
AzureMetrics
| where ResourceProvider == "MICROSOFT.EVENTHUB"
| where MetricName == "ThrottledRequests"
| where Total > 0
| summarize ThrottleCount = sum(Total) by bin(TimeGenerated, 5m), EntityName
| order by TimeGenerated desc
Partition Distribution Health
// Custom log from your application (if you emit partition ownership events)
CustomLogs_CL
| where Category == "EventHubProcessor"
| where Message contains "partition"
| summarize PartitionsOwned = dcount(PartitionId) by InstanceId, bin(TimeGenerated, 1m)
| render columnchart
Summary
Consumer groups and partitions are the two levers that control how you scale Event Hubs consumption:
- Partitions set the ceiling for parallelism and throughput.
- Consumer groups let multiple independent applications read the same stream.
- Checkpointing controls the trade-off between reprocessing risk and storage I/O.
- Partition balancing happens automatically but understanding it helps you debug issues.
Start with a clear mental model: Event Hubs is a partitioned, append-only log. Consumer groups are independent cursors into that log. Everything else — balancing, checkpointing, scaling — follows from these two concepts.