Azure API Management — APIM as AI Gateway (Azure OpenAI)

Token Rate Limiting, Regional Load Balancing, Usage Logging


Overview

Use APIM as an AI gateway to manage Azure OpenAI requests with rate limiting, token tracking, and multi-region load balancing.


Token Rate Limiting

<policies>
    <inbound>
        <base />
        
        <!-- Extract token count from request -->
        <set-variable name="prompt-tokens" 
            value="@{
                var body = context.Request.Body.As<JObject>();
                var messages = body["messages"] as JArray;
                var totalTokens = 0;
                foreach (var msg in messages) {
                    totalTokens += ((string)msg["content"]).Length / 4; // Approximate
                }
                return totalTokens;
            }" />
        
        <!-- Rate limit by token count -->
        <rate-limit-by-key calls="100" 
            renewal-period="60" 
            increment-count="@(context.Variables.GetValue<int>("prompt-tokens"))"
            counter-key="@(context.Subscription.Id)" />
    </inbound>
    
    <backend>
        <set-backend-service base-url="https://my-resource.openai.azure.com" />
        
        <set-header name="api-key" exists-action="override">
            <value>{{OpenAI-Key}}</value>
        </set-header>
        
        <!-- Rewrite to chat/completions -->
        <rewrite-uri template="/openai/deployments/{deployment-name}/chat/completions?api-version=2024-02-01" />
    </backend>
    
    <outbound>
        <base />
        <!-- Log token usage -->
        <log-to-eventhub>
            @{
                var response = context.Response.Body.As<JObject>();
                return new {
                    timestamp = DateTime.UtcNow,
                    subscriptionId = context.Subscription.Id,
                    promptTokens = context.Variables.GetValue<int>("prompt-tokens"),
                    completionTokens = response["usage"]?["completion_tokens"] ?? 0,
                    totalTokens = response["usage"]?["total_tokens"] ?? 0
                };
            }
        </log-to-eventhub>
    </outbound>
</policies>

Regional Load Balancing

<backend>
    <choose>
        <when condition="@(DateTime.UtcNow.Hour < 12)">
            <!-- East US during morning -->
            <set-backend-service base-url="https://eastus.openai.azure.com" />
        </when>
        <when condition="@(DateTime.UtcNow.Hour < 18)">
            <!-- West Europe afternoon -->
            <set-backend-service base-url="https://westeurope.openai.azure.com" />
        </when>
        <otherwise>
            <set-backend-service base-url="https://eastus.openai.azure.com" />
        </otherwise>
    </choose>
</backend>

Quota by Subscription

<inbound>
    <quota-by-key calls="10000" 
        bandwidth="100000000" 
        renewal-period="86400"
        counter-key="@(context.Subscription.Id)" />
</inbound>

Token Usage Tracking

-- Log to Log Analytics
AzureDiagnostics
| where TimeGenerated >= ago(1h)
| where OperationName == "Inbound"
| where ApiId contains "openai"
| extend promptTokens = toint(parsejson(Details)["promptTokens"])
| summarize totalTokens=sum(promptTokens) by SubscriptionId, bin(TimeGenerated, 1h)

Cache Responses

<outbound>
    <cache-lookup>
        <vary-by-query-parameter>messages</vary-by-query-parameter>
    </cache-lookup>
    
    <cache-store duration="3600" />
</outbound>

Best Practices

PracticeBenefit
Token-based rate limitControl actual API usage
Regional routingReduce latency
Usage loggingTrack costs per subscription
CachingReduce API calls

Azure Integration Hub - Advanced Level