APIM as AI Gateway — Azure OpenAI Integration

Why Use APIM for AI Services?

As AI becomes central to applications, managing access to services like Azure OpenAI requires careful attention to:

Cost Control - AI API calls can be expensive; need usage tracking and limits
Security - Protect API keys, validate requests, prevent abuse
Performance - Handle high concurrency, reduce latency
Reliability - Fallback strategies, regional redundancy
Observability - Track usage, monitor costs, debug issues

Azure API Management (APIM) provides all these capabilities out of the box.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                    APIM as AI Gateway Architecture                          │
└─────────────────────────────────────────────────────────────────────────────┘

┌────────────────────┐       ┌─────────────────────────────────────────────────────┐
│   Client Apps      │       │              Azure API Management                   │
│                    │       │                                                     │
│  - Web Apps        │       │  ┌─────────────────────────────────────────────┐    │
│  - Mobile Apps     │──────▶│  │  Policies                                   │    │
│  - Backend APIs    │       │  │  - Authentication (Validate JWT)            │    │
│  - Chatbots        │       │  │  - Token Rate Limiting                      │    │
│                    │       │  │  - Prompt Caching                           │    │
│                    │       │  │  - Request Validation                       │    │
└────────────────────┘       │  │  - Response Transformation                  │    │
                             │  │  - Usage Tracking                           │    │
                             │  └─────────────────────────────────────────────┘    │
                             └─────────────────────┬───────────────────────────────┘
                                                   │
                    ┌──────────────────────────────┼───────────────────────────────┐
                    │                              │                               │
                    ▼                              ▼                               ▼
         ┌─────────────────────┐      ┌─────────────────────┐      ┌─────────────────────┐
         │  East US OpenAI     │      │  West Europe OpenAI │      │  Southeast Asia     │
         │  (Primary)          │      │  (Failover)         │      │  (Failover)         │
         └─────────────────────┘      └─────────────────────┘      └─────────────────────┘

Step 1: OpenAI API Backend Configuration

Setting Up the Backend

<!-- backend-policy.xml -->
<backend>
    <url>https://YOUR_RESOURCE_NAME.openai.azure.com</url>
    <protocol>https</protocol>
    <tls>
        <validate-server-certificate>true</validate-server-certificate>
    </tls>
    <headers>
        <!-- Add API key to all requests -->
        <header name="api-key">{{OPENAI_API_KEY}}</header>
        <!-- Azure AD token for managed identity -->
        <header name="Authorization">Bearer {{AZURE_AD_TOKEN}}</header>
    </headers>
</backend>

// Configure OpenAI backend in Azure
var backend = new BackendContract
{
    Name = "openai-backend",
    Url = "https://my-openai.openai.azure.com",
    Protocol = BackendProtocol.Http,
    Properties = new BackendProperties
    {
        BackendUrl = "https://my-openai.openai.azure.com",
        ServiceFabricCluster = null,
        Description = "Azure OpenAI Service"
    },
    ResourceId = null,
    Credentials = new BackendCredentials
    {
        Header = new Dictionary<string, string>
        {
            {"api-key", "your-api-key"}  // Store in Key Vault!
        }
    }
};

await client.PutAsync($"/backends/openai-backend", backend);

Using Managed Identity (Recommended)

// For production, use Managed Identity instead of API keys
// 1. Enable Managed Identity on APIM
// 2. Grant APIM access to OpenAI resource
// 3. Get token and use in backend policy

<backend>
    <url>https://my-resource.openai.azure.com</url>
    <authorization>
        <authentication-scheme>managed-identity</authentication-scheme>
    </authorization>
</backend>

Step 2: Token-Based Rate Limiting

Why Token-Based Instead of Request-Based?

Request-Based vs Token-Based Rate Limiting:

Request-Based:
┌─────────────────────────────────────────────────────────────────┐
│ User makes 100 requests per hour - all counted equally          │
│                                                                 │
│ "Summarize this document"     = 1 request (cost: $0.01)         │
│ "Write a novel"               = 1 request (cost: $0.50)         │
│                                                                 │
│ Problem: Easy to abuse with expensive requests                  │
└─────────────────────────────────────────────────────────────────┘

Token-Based:
┌─────────────────────────────────────────────────────────────────┐
│ User has 100,000 tokens per hour budget                         │
│                                                                 │
│ "Summarize this document"     = 500 tokens (cost: $0.01)        │
│ "Write a novel"               = 50,000 tokens (cost: $1.00)     │
│                                                                 │
│ Benefit: Fair allocation based on actual usage                  │
└─────────────────────────────────────────────────────────────────┘

Implementation

<inbound>
    <!-- Step 1: Extract token count from request body -->
    <set-variable name="prompt-tokens" value="@{
        var body = context.Request.Body.As<JObject>();
        if (body == null) return 0;
        
        var messages = body["messages"] as JArray;
        if (messages == null) return 0;
        
        // Estimate tokens - roughly 1 token per 4 characters
        // For more accuracy, use actual tokenizer
        var content = string.Join(\" \", messages.Select(m => m[\"content\"]?.ToString() ?? \"\"));
        return content.Length / 4;
    }" />

    <!-- Step 2: Get current usage from cache or KV store -->
    <set-variable name="current-usage" value="@{
        var userId = context.User?.FindFirst(\"oid\")?.Value ?? \"anonymous\";
        var cacheKey = $\"token_usage_{userId}\";
        
        // Try to get from cache
        var cached = context.Variables.GetValueOrDefault<JObject>(cacheKey);
        if (cached == null)
        {
            // Initialize new counter
            return new JObject
            {
                [\"used\"] = 0,
                [\"window_start\"] = DateTime.UtcNow.AddHours(-1),
                [\"reset_at\"] = DateTime.UtcNow.AddHours(1)
            };
        }
        
        // Check if window has expired
        var windowStart = cached.Value<DateTime>(\"window_start\");
        if (DateTime.UtcNow > windowStart.AddHours(1))
        {
            // Reset window
            return new JObject
            {
                [\"used\"] = 0,
                [\"window_start\"] = DateTime.UtcNow,
                [\"reset_at\"] = DateTime.UtcNow.AddHours(1)
            };
        }
        
        return cached;
    }" />

    <!-- Step 3: Check if within limits -->
    <choose>
        <when condition="@{
            var usage = context.Variables.GetValueOrDefault<JObject>(\"current-usage\");
            var used = usage.Value<long>(\"used\");
            var limit = 100000L; // 100k tokens per hour
            var promptTokens = context.Variables.GetValueOrDefault<long>(\"prompt-tokens\");
            
            return (used + promptTokens) > limit;
        }">
            <!-- Rate limit exceeded -->
            <return-response>
                <set-status code="429" reason="Too Many Requests" />
                <set-body>{
                    "error": "Token limit exceeded",
                    "message": "You have used {used} tokens this hour. Limit: {limit}",
                    "retry_after": "3600"
                }</set-body>
            </return-response>
        </when>
    </choose>

    <!-- Step 4: Add usage tracking header -->
    <set-header name="X-Token-Usage" exists-action="override">
        <value>@{
            var usage = context.Variables.GetValueOrDefault<JObject>(\"current-usage\");
            return usage.Value<long>(\"used\").ToString();
        }</value>
    </set-header>

    <base />
</inbound>

C# Implementation for Token Tracking

public class TokenRateLimitingService
{
    private readonly ICacheClient _cache;
    private readonly ILogger<TokenRateLimitingService> _logger;

    public TokenRateLimitingService(
        ICacheClient cache,
        ILogger<TokenRateLimitingService> logger)
    {
        _cache = cache;
        _logger = logger;
    }

    public async Task<RateLimitResult> CheckAndUpdateUsageAsync(
        string userId, 
        long tokenCount,
        long maxTokensPerHour = 100000)
    {
        var cacheKey = $"token_usage_{userId}";
        
        // Get current usage
        var usageJson = await _cache.GetStringAsync(cacheKey);
        var usage = string.IsNullOrEmpty(usageJson)
            ? new TokenUsage { ResetAt = DateTime.UtcNow.AddHours(1) }
            : JsonSerializer.Deserialize<TokenUsage>(usageJson);

        // Check if window expired
        if (DateTime.UtcNow >= usage.ResetAt)
        {
            usage = new TokenUsage
            {
                Used = 0,
                ResetAt = DateTime.UtcNow.AddHours(1)
            };
        }

        // Check limit
        if (usage.Used + tokenCount > maxTokensPerHour)
        {
            var remaining = maxTokensPerHour - usage.Used;
            
            _logger.LogWarning(
                "User {UserId} exceeded token limit. Used: {Used}, Requested: {Requested}, Limit: {Limit}",
                userId, usage.Used, tokenCount, maxTokensPerHour);

            return new RateLimitResult
            {
                Allowed = false,
                RemainingTokens = remaining,
                ResetAt = usage.ResetAt,
                RetryAfter = (usage.ResetAt - DateTime.UtcNow).TotalSeconds
            };
        }

        // Update usage
        usage.Used += tokenCount;
        await _cache.SetStringAsync(cacheKey, JsonSerializer.Serialize(usage),
            TimeSpan.FromHours(1));

        _logger.LogInformation(
            "User {UserId} used {Tokens} tokens. Total: {Used}, Limit: {Limit}",
            userId, tokenCount, usage.Used, maxTokensPerHour);

        return new RateLimitResult
        {
            Allowed = true,
            RemainingTokens = maxTokensPerHour - usage.Used,
            ResetAt = usage.ResetAt
        };
    }
}

public class TokenUsage
{
    public long Used { get; set; }
    public DateTime ResetAt { get; set; }
}

public class RateLimitResult
{
    public bool Allowed { get; set; }
    public long RemainingTokens { get; set; }
    public DateTime ResetAt { get; set; }
    public double RetryAfter { get; set; }
}

Step 3: Prompt Caching Strategy

Why Cache Prompts?

Cost Comparison - Caching:

Without Cache:
┌────────────────────────────────────────────────────────────┐
│ 1,000 users request "What is Azure?" at 9 AM               │
│                                                            │
│ 1,000 × 10 tokens × $0.001 = $10.00 per identical request  │
│                                                            │
│ Wasteful! Everyone gets the same answer                    │
└────────────────────────────────────────────────────────────┘

With Cache:
┌────────────────────────────────────────────────────────────┐
│ First request:  Compute answer (10 tokens)  → $0.01        │
│ Next 999 requests: Return cached answer (0 tokens) → $0    │
│                                                            │
│ Total: $0.01 instead of $10.00 - 99.9% savings!            │
└────────────────────────────────────────────────────────────┘

Caching Policy

<inbound>
    <!-- Generate cache key based on prompt content -->
    <set-variable name="cache-key" value="@{
        var body = context.Request.Body.As<JObject>();
        if (body == null) return null;
        
        var messages = body[\"messages\"] as JArray;
        if (messages == null) return null;
        
        // Create deterministic cache key from messages
        // Include only relevant parts for cache key
        var cacheContent = string.Join(\"|\", messages.Select(m => 
            $\"{(m[\"role\"] ?? \"\")}:{(m[\"content\"] ?? \"\")}\"
        ));
        
        // Hash the content for consistent key
        using var sha = SHA256.Create();
        var hash = sha.ComputeHash(Encoding.UTF8.GetBytes(cacheContent));
        return Convert.ToBase64String(hash).Substring(0, 32);
    }" />

    <!-- Try to get cached response -->
    <cache-lookup-value key="@(context.Variables["cache-key"])" variable-name="cached-response" />

    <choose>
        <when condition="@(context.Variables.GetValueOrDefault<string>("cached-response") != null)">
            <!-- Return cached response -->
            <return-response>
                <set-body>@(context.Variables["cached-response"])</set-body>
                <set-header name="X-Cache" exists-action="override">
                    <value>HIT</value>
                </set-header>
            </return-response>
        </when>
    </choose>

    <base />
</inbound>

<outbound>
    <!-- Cache successful responses -->
    <choose>
        <when condition="@(context.Response.StatusCode == 200)">
            <set-variable name="response-body" value="@(context.Response.Body.As<JObject>())" />

            <!-- Only cache if response contains actual content -->
            <choose>
                <when condition="@(context.Variables.GetValueOrDefault<JObject>("response-body")?[\"choices\"]?.Count > 0)">
                    <cache-store-value key="@(context.Variables["cache-key"])" 
                                       value="@(context.Variables["response-body"])"
                                       duration="3600" />
                    
                    <set-header name="X-Cache" exists-action="override">
                        <value>MISS</value>
                    </set-header>
                </when>
            </choose>
        </when>
    </choose>
</outbound>

Smart Caching - Identify Cacheable Requests

public class CacheableRequestDetector
{
    // Determine if a request should be cached
    public bool ShouldCache(ChatCompletionRequest request)
    {
        // Don't cache if:
        
        // 1. Uses temperature (randomness) - different response each time
        if (request.Temperature > 0.5m)
            return false;

        // 2. Has system messages that might vary
        if (request.Messages.Any(m => m.Role == "system" && 
            m.Content.Contains("{{")))
            return false;

        // 3. Is a streaming request (can't cache partial responses)
        if (request.Stream == true)
            return false;

        // 4. Contains user-specific context
        if (request.Messages.Any(m => 
            m.Content.Contains("my ") || 
            m.Content.Contains("I am ")))
            return false;

        // Cache if: static prompts, documentation lookups, FAQs
        return true;
    }

    // Determine cache duration based on content type
    public TimeSpan GetCacheDuration(ChatCompletionRequest request)
    {
        var firstMessage = request.Messages.FirstOrDefault()?.Content?.ToLower() ?? "";

        if (firstMessage.Contains("documentation") || 
            firstMessage.Contains("help") ||
            firstMessage.Contains("faq"))
        {
            // Documentation can be cached longer
            return TimeSpan.FromHours(24);
        }

        if (firstMessage.Contains("what is") || 
            firstMessage.Contains("define"))
        {
            // General knowledge questions - medium cache
            return TimeSpan.FromHours(4);
        }

        // Default: shorter cache
        return TimeSpan.FromHours(1);
    }
}

Step 4: Regional Load Balancing

Multi-Region Setup

<inbound>
    <!-- Select backend based on user's region -->
    <set-variable name="target-region" value="@{
        var region = context.Request.Headers.GetValueOrDefault(\"X-User-Region\", \"auto\");
        
        // If explicitly specified, use that region
        if (region != \"auto\") return region;
        
        // Try to determine from IP or other headers
        var forwardedRegion = context.Request.Headers.GetValueOrDefault(\"X-Forwarded-Region\");
        if (!string.IsNullOrEmpty(forwardedRegion)) return forwardedRegion;
        
        // Default to primary region
        return \"eastus\";
    }" />

    <!-- Route to appropriate backend -->
    <set-backend-service id="@{
        var region = context.Variables.GetValueOrDefault<string>(\"target-region\");
        
        return region switch
        {
            \"eastus\" => \"backend-eastus\",
            \"westeurope\" => \"backend-westeurope\",
            \"southeastasia\" => \"backend-southeastasia\",
            _ => \"backend-eastus\"  // Default
        };
    }" />

    <base />
</inbound>

Health-Based Routing

public class OpenAIBackendManager
{
    private readonly List<OpenAIBackend> _backends;
    private readonly ILogger<OpenAIBackendManager> _logger;

    public OpenAIBackendManager(IConfiguration configuration, ILogger<OpenAIBackendManager> logger)
    {
        _logger = logger;
        
        // Configure your available backends
        _backends = new List<OpenAIBackend>
        {
            new() { 
                Name = "eastus-primary", 
                Endpoint = "https://eastus-openai.openai.azure.com",
                Region = "East US",
                IsPrimary = true 
            },
            new() { 
                Name = "westeurope-failover", 
                Endpoint = "https://westeurope-openai.openai.azure.com",
                Region = "West Europe",
                IsPrimary = false 
            },
            new() { 
                Name = "southeastasia-failover", 
                Endpoint = "https://southeastasia.openai.azure.com",
                Region = "Southeast Asia",
                IsPrimary = false 
            }
        };
    }

    public async Task<BackendSelectionResult> SelectBackendAsync()
    {
        // Try primary first
        var primary = _backends.FirstOrDefault(b => b.IsPrimary);
        
        if (await IsHealthyAsync(primary))
        {
            _logger.LogInformation("Using primary backend: {Name}", primary.Name);
            return new BackendSelectionResult { Backend = primary, Reason = "Primary healthy" };
        }

        // Try fallbacks
        foreach (var fallback in _backends.Where(b => !b.IsPrimary))
        {
            if (await IsHealthyAsync(fallback))
            {
                _logger.LogWarning("Primary failed, using fallback: {Name}", fallback.Name);
                return new BackendSelectionResult { Backend = fallback, Reason = "Primary unhealthy" };
            }
        }

        // All backends unhealthy - return primary anyway (better than nothing)
        _logger.LogError("All backends unhealthy, using primary");
        return new BackendSelectionResult { Backend = primary, Reason = "All unhealthy" };
    }

    private async Task<bool> IsHealthyAsync(OpenAIBackend backend)
    {
        try
        {
            // Simple health check - call the models endpoint
            var client = new HttpClient { BaseAddress = new Uri(backend.Endpoint) };
            client.Timeout = TimeSpan.FromSeconds(5);
            
            var response = await client.GetAsync("/openai/models?api-version=2023-05-15");
            return response.IsSuccessStatusCode;
        }
        catch (Exception ex)
        {
            _logger.LogWarning(ex, "Health check failed for {Name}", backend.Name);
            return false;
        }
    }
}

public class OpenAIBackend
{
    public string Name { get; set; }
    public string Endpoint { get; set; }
    public string Region { get; set; }
    public bool IsPrimary { get; set; }
}

public class BackendSelectionResult
{
    public OpenAIBackend Backend { get; set; }
    public string Reason { get; set; }
}

Step 5: Request/Response Transformation

Standardize API Surface

<inbound>
    <!-- Transform request to OpenAI format -->
    <set-body>@{
        var request = context.Request.Body.As<JObject();
        
        // Our API uses 'prompt', convert to OpenAI 'messages' format
        if (request[\"prompt\"] != null && request[\"messages\"] == null)
        {
            var prompt = request[\"prompt\"].ToString();
            request[\"messages\"] = new JArray
            {
                new JObject
                {
                    [\"role\"] = \"user\",
                    [\"content\"] = prompt
                }
            };
            request.Remove(\"prompt\");
        }
        
        // Add default parameters if not specified
        if (request[\"temperature\"] == null)
            request[\"temperature\"] = 0.7m;
            
        if (request[\"max_tokens\"] == null)
            request[\"max_tokens\"] = 1000;

        return request.ToString();
    }" />
</inbound>

<outbound>
    <!-- Transform response to our format -->
    <set-body>@{
        var response = context.Response.Body.As<JObject();
        
        // Our API returns simpler format
        var result = new JObject
        {
            [\"id\"] = response[\"id\"],
            [\"created\"] = response[\"created\"],
            [\"answer\"] = response[\"choices\"]?[0]?[\"message\"]?[\"content\"],
            [\"usage\"] = response[\"usage\"],
            [\"model\"] = response[\"model\"]
        };
        
        return result.ToString();
    }" />
</outbound>

Step 6: Cost Tracking and Analytics

Usage Tracking Implementation

public class OpenAIUsageTracker
{
    private readonly ITableClient _tableClient;
    private readonly ILogger<OpenAIUsageTracker> _logger;

    public async Task RecordUsageAsync(UsageRecord record)
    {
        try
        {
            // Store in Azure Table
            await _tableClient.AddEntityAsync(new TableEntity
            {
                PartitionKey = record.Date.ToString("yyyy-MM-dd"),
                RowKey = Guid.NewGuid().ToString(),
                ["UserId"] = record.UserId,
                ["PromptTokens"] = record.PromptTokens,
                ["CompletionTokens"] = record.CompletionTokens,
                ["TotalTokens"] = record.TotalTokens,
                ["Cost"] = record.Cost,
                ["Model"] = record.Model,
                ["Endpoint"] = record.Endpoint,
                ["Timestamp"] = DateTime.UtcNow
            });

            _logger.LogDebug("Recorded usage for user {UserId}: {Tokens} tokens (${Cost})",
                record.UserId, record.TotalTokens, record.Cost);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Failed to record usage");
        }
    }

    public async Task<CostSummary> GetCostSummaryAsync(
        string userId, 
        DateTime from, 
        DateTime to)
    {
        // Query and aggregate usage
        var query = $"PartitionKey ge '{from:yyyy-MM-dd}' and PartitionKey le '{to:yyyy-MM-dd}'";
        
        // If user specified, filter
        if (!string.IsNullOrEmpty(userId))
            query += $" and UserId eq '{userId}'";

        var results = new List<UsageRecord>();
        
        await foreach (var entity in _tableClient.QueryAsync<TableEntity>(query))
        {
            results.Add(new UsageRecord
            {
                UserId = entity.GetString("UserId"),
                PromptTokens = entity.GetInt32("PromptTokens"),
                CompletionTokens = entity.GetInt32("CompletionTokens"),
                TotalTokens = entity.GetInt32("TotalTokens"),
                Cost = entity.GetDouble("Cost")
            });
        }

        return new CostSummary
        {
            TotalCost = results.Sum(r => r.Cost),
            TotalTokens = results.Sum(r => r.TotalTokens),
            RequestCount = results.Count,
            AverageCostPerRequest = results.Any() 
                ? results.Average(r => r.Cost) 
                : 0
        };
    }
}

// Pricing (example rates - check Azure pricing)
public static class OpenAIPricing
{
    // Prices per 1K tokens (example)
    public const decimal GPT4_8K_Input = 0.03m;
    public const decimal GPT4_8K_Output = 0.06m;
    public const decimal GPT35_4K_Input = 0.001m;
    public const decimal GPT35_4K_Output = 0.002m;

    public static decimal CalculateCost(
        string model,
        int promptTokens,
        int completionTokens)
    {
        // Determine pricing tier
        var (inputPrice, outputPrice) = model.ToLower() switch
        {
            var m when m.Contains("gpt-4") => (GPT4_8K_Input, GPT4_8K_Output),
            _ => (GPT35_4K_Input, GPT35_4K_Output)
        };

        var inputCost = (promptTokens / 1000m) * inputPrice;
        var outputCost = (completionTokens / 1000m) * outputPrice;

        return inputCost + outputCost;
    }
}

public class UsageRecord
{
    public string UserId { get; set; }
    public DateTime Date { get; set; }
    public int PromptTokens { get; set; }
    public int CompletionTokens { get; set; }
    public int TotalTokens => PromptTokens + CompletionTokens;
    public decimal Cost { get; set; }
    public string Model { get; set; }
    public string Endpoint { get; set; }
}

public class CostSummary
{
    public decimal TotalCost { get; set; }
    public int TotalTokens { get; set; }
    public int RequestCount { get; set; }
    public decimal AverageCostPerRequest { get; set; }
}

Step 7: Complete Policy Example

<policies>
    <inbound>
        <!-- 1. Authentication -->
        <validate-jwt header-name="Authorization" failed-validation-error-message="Unauthorized">
            <openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
            <audiences>
                <audience>api://your-app-id</audience>
            </audiences>
        </validate-jwt>

        <!-- 2. Extract and validate prompt -->
        <set-variable name="prompt-tokens" value="@{
            var body = context.Request.Body.As<JObject>();
            if (body == null) return 0;
            
            var messages = body[\"messages\"] as JArray;
            if (messages == null) return 0;
            
            var content = string.Join(\" \", messages.Select(m => m[\"content\"]?.ToString() ?? \"\"));
            return content.Length / 4;  // Rough estimate
        }" />

        <!-- 3. Check rate limit -->
        <set-variable name="rate-limit-check" value="@{
            var userId = context.User?.FindFirst(\"oid\")?.Value;
            // Call your rate limiting service
            return true; // Simplified
        }" />

        <choose>
            <when condition="@(!context.Variables.GetValueOrDefault<bool>(\"rate-limit-check\"))">
                <return-response>
                    <set-status code="429" reason="Too Many Requests" />
                    <set-body>{"error": "Token rate limit exceeded"}</set-body>
                </return-response>
            </when>
        </choose>

        <!-- 4. Try cache lookup -->
        <set-variable name="cache-key" value="@{/* Generate cache key */}" />
        <cache-lookup-value key="@(context.Variables["cache-key"])" variable-name="cached-response" />

        <choose>
            <when condition="@(context.Variables.GetValueOrDefault<string>(\"cached-response\") != null)">
                <return-response>
                    <set-body>@(context.Variables["cached-response"])</set-body>
                    <set-header name="X-Cache" exists-action="override"><value>HIT</value></set-header>
                </return-response>
            </when>
        </choose>

        <!-- 5. Route to backend -->
        <set-backend-service id="backend-openai" />

        <base />
    </inbound>

    <backend>
        <forward-uri-keep-encode-slash>true</forward-uri-keep-encode-slash>
        <base />
    </backend>

    <outbound>
        <!-- 6. Cache successful responses -->
        <choose>
            <when condition="@(context.Response.StatusCode == 200)">
                <cache-store-value key="@(context.Variables[\"cache-key\"])" 
                                   value="@(context.Response.Body)"
                                   duration="3600" />
                <set-header name="X-Cache" exists-action="override"><value>MISS</value></set-header>
            </when>
        </choose>

        <!-- 7. Add usage tracking -->
        <set-header name="X-Rate-Limit-Limit" exists-action="override"><value>100000</value></set-header>
        <set-header name="X-Rate-Limit-Remaining" exists-action="override"><value>90000</value></set-header>

        <base />
    </outbound>
</policies>

Best Practices Summary

Practice	Why	Implementation
Token-based rate limiting	Fair cost distribution	Track token usage, not just requests
Prompt caching	Reduce costs	Cache static/deterministic prompts
Regional routing	Low latency + redundancy	Route to nearest healthy region
Managed Identity	Better security	No API keys to manage
Usage tracking	Cost visibility	Log all requests with tokens
Response transformation	API consistency	Standardize across backends

Monitoring Dashboard

// Azure Monitor queries for OpenAI usage

// Total tokens by hour
requests
| where url contains "openai" and operation_Name == "POST"
| extend promptTokens = customDimensions.PromptTokens
| summarize sum(promptTokens) by bin(timestamp, 1h)

// Cost by user
requests
| where url contains "openai"
| extend cost = customDimensions.Cost
| summarize sum(cost) by user_Id

// Cache hit rate
requests
| where url contains "openai"
| extend cacheStatus = customDimensions.CacheHit
| summarize count() by cacheStatus
| render piechart

Conclusion

Using APIM as an AI gateway provides:

Cost Control - Token-based rate limiting prevents runaway costs
Performance - Caching reduces latency and API calls
Reliability - Multi-region routing with health checks
Security - Centralized authentication and validation
Observability - Complete usage tracking and analytics

Key takeaways:

Implement token-based limits, not request-based
Cache deterministic prompts for huge savings
Use regional routing for global applications
Track everything - you can't optimize what you don't measure

Azure Integration Hub - API Management