Azure API Management — APIM as AI Gateway (Azure OpenAI)

Token Rate Limiting, Regional Load Balancing, Usage Logging

Overview

Use APIM as an AI gateway to manage Azure OpenAI requests with rate limiting, token tracking, and multi-region load balancing.

Token Rate Limiting

<policies>
    <inbound>
        <base />
        
        <!-- Extract token count from request -->
        <set-variable name="prompt-tokens" 
            value="@{
                var body = context.Request.Body.As<JObject>(preserveContent: true);
                var messages = body["messages"] as JArray;
                var totalTokens = 0;
                foreach (var msg in messages) {
                    totalTokens += ((string)msg["content"]).Length / 4; // Approximate
                }
                return totalTokens.ToString();
            }" />
        
        <!-- Rate limit by token count -->
        <rate-limit-by-key calls="100" 
            renewal-period="60" 
            increment-count="@(int.Parse((string)context.Variables["prompt-tokens"]))"
            counter-key="@(context.Subscription.Id)" />
        
        <!-- Set API key and rewrite URI (must be in inbound) -->
        <set-header name="api-key" exists-action="override">
            <value>{{OpenAI-Key}}</value>
        </set-header>
        <rewrite-uri template="/openai/deployments/{deployment-name}/chat/completions?api-version=2024-02-01" />
    </inbound>
    
    <backend>
        <set-backend-service base-url="https://my-resource.openai.azure.com" />
    </backend>
    
    <outbound>
        <base />
        <!-- Log token usage -->
        <log-to-eventhub logger-id="{{eventhub-logger-id}}">
            @{
                var response = context.Response.Body.As<JObject>(preserveContent: true);
                return new JObject(
                    new JProperty("timestamp", DateTime.UtcNow),
                    new JProperty("subscriptionId", context.Subscription.Id),
                    new JProperty("promptTokens", int.Parse((string)context.Variables["prompt-tokens"])),
                    new JProperty("completionTokens", response["usage"]?["completion_tokens"] ?? 0),
                    new JProperty("totalTokens", response["usage"]?["total_tokens"] ?? 0)
                ).ToString();
            }
        </log-to-eventhub>
    </outbound>
</policies>

Regional Load Balancing

<backend>
    <choose>
        <when condition="@(DateTime.UtcNow.Hour < 12)">
            <!-- East US during morning -->
            <set-backend-service base-url="https://eastus.openai.azure.com" />
        </when>
        <when condition="@(DateTime.UtcNow.Hour < 18)">
            <!-- West Europe afternoon -->
            <set-backend-service base-url="https://westeurope.openai.azure.com" />
        </when>
        <otherwise>
            <set-backend-service base-url="https://eastus.openai.azure.com" />
        </otherwise>
    </choose>
</backend>

Quota by Subscription

<inbound>
    <quota-by-key calls="10000" 
        bandwidth="100000000" 
        renewal-period="86400"
        counter-key="@(context.Subscription.Id)" />
</inbound>

Token Usage Tracking

-- Log to Log Analytics
AzureDiagnostics
| where TimeGenerated >= ago(1h)
| where OperationName == "Inbound"
| where ApiId contains "openai"
| extend promptTokens = toint(parsejson(Details)["promptTokens"])
| summarize totalTokens=sum(promptTokens) by SubscriptionId, bin(TimeGenerated, 1h)

Cache Responses

<inbound>
    <cache-lookup>
        <vary-by-query-parameter>messages</vary-by-query-parameter>
    </cache-lookup>
</inbound>

<outbound>
    <cache-store duration="3600" />
</outbound>

Best Practices

Practice	Benefit
Token-based rate limit	Control actual API usage
Regional routing	Reduce latency
Usage logging	Track costs per subscription
Caching	Reduce API calls

Azure Integration Hub - Advanced Level