Azure API Management — APIM as AI Gateway (Azure OpenAI)
Token Rate Limiting, Regional Load Balancing, Usage Logging
Overview
Use APIM as an AI gateway to manage Azure OpenAI requests with rate limiting, token tracking, and multi-region load balancing.
Token Rate Limiting
<policies>
<inbound>
<base />
<!-- Extract token count from request -->
<set-variable name="prompt-tokens"
value="@{
var body = context.Request.Body.As<JObject>();
var messages = body["messages"] as JArray;
var totalTokens = 0;
foreach (var msg in messages) {
totalTokens += ((string)msg["content"]).Length / 4; // Approximate
}
return totalTokens;
}" />
<!-- Rate limit by token count -->
<rate-limit-by-key calls="100"
renewal-period="60"
increment-count="@(context.Variables.GetValue<int>("prompt-tokens"))"
counter-key="@(context.Subscription.Id)" />
</inbound>
<backend>
<set-backend-service base-url="https://my-resource.openai.azure.com" />
<set-header name="api-key" exists-action="override">
<value>{{OpenAI-Key}}</value>
</set-header>
<!-- Rewrite to chat/completions -->
<rewrite-uri template="/openai/deployments/{deployment-name}/chat/completions?api-version=2024-02-01" />
</backend>
<outbound>
<base />
<!-- Log token usage -->
<log-to-eventhub>
@{
var response = context.Response.Body.As<JObject>();
return new {
timestamp = DateTime.UtcNow,
subscriptionId = context.Subscription.Id,
promptTokens = context.Variables.GetValue<int>("prompt-tokens"),
completionTokens = response["usage"]?["completion_tokens"] ?? 0,
totalTokens = response["usage"]?["total_tokens"] ?? 0
};
}
</log-to-eventhub>
</outbound>
</policies>
Regional Load Balancing
<backend>
<choose>
<when condition="@(DateTime.UtcNow.Hour < 12)">
<!-- East US during morning -->
<set-backend-service base-url="https://eastus.openai.azure.com" />
</when>
<when condition="@(DateTime.UtcNow.Hour < 18)">
<!-- West Europe afternoon -->
<set-backend-service base-url="https://westeurope.openai.azure.com" />
</when>
<otherwise>
<set-backend-service base-url="https://eastus.openai.azure.com" />
</otherwise>
</choose>
</backend>
Quota by Subscription
<inbound>
<quota-by-key calls="10000"
bandwidth="100000000"
renewal-period="86400"
counter-key="@(context.Subscription.Id)" />
</inbound>
Token Usage Tracking
-- Log to Log Analytics
AzureDiagnostics
| where TimeGenerated >= ago(1h)
| where OperationName == "Inbound"
| where ApiId contains "openai"
| extend promptTokens = toint(parsejson(Details)["promptTokens"])
| summarize totalTokens=sum(promptTokens) by SubscriptionId, bin(TimeGenerated, 1h)
Cache Responses
<outbound>
<cache-lookup>
<vary-by-query-parameter>messages</vary-by-query-parameter>
</cache-lookup>
<cache-store duration="3600" />
</outbound>
Best Practices
| Practice | Benefit |
|---|---|
| Token-based rate limit | Control actual API usage |
| Regional routing | Reduce latency |
| Usage logging | Track costs per subscription |
| Caching | Reduce API calls |
Azure Integration Hub - Advanced Level