APIM as AI Gateway — Azure OpenAI Integration
Why Use APIM for AI Services?
As AI becomes central to applications, managing access to services like Azure OpenAI requires careful attention to:
- Cost Control - AI API calls can be expensive; need usage tracking and limits
- Security - Protect API keys, validate requests, prevent abuse
- Performance - Handle high concurrency, reduce latency
- Reliability - Fallback strategies, regional redundancy
- Observability - Track usage, monitor costs, debug issues
Azure API Management (APIM) provides all these capabilities out of the box.
Architecture Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ APIM as AI Gateway Architecture │
└─────────────────────────────────────────────────────────────────────────────┘
┌────────────────────┐ ┌─────────────────────────────────────────────────────┐
│ Client Apps │ │ Azure API Management │
│ │ │ │
│ - Web Apps │ │ ┌─────────────────────────────────────────────┐ │
│ - Mobile Apps │──────▶│ │ Policies │ │
│ - Backend APIs │ │ │ - Authentication (Validate JWT) │ │
│ - Chatbots │ │ │ - Token Rate Limiting │ │
│ │ │ │ - Prompt Caching │ │
│ │ │ │ - Request Validation │ │
└────────────────────┘ │ │ - Response Transformation │ │
│ │ - Usage Tracking │ │
│ └─────────────────────────────────────────────┘ │
└─────────────────────┬───────────────────────────────┘
│
┌──────────────────────────────┼───────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ East US OpenAI │ │ West Europe OpenAI │ │ Southeast Asia │
│ (Primary) │ │ (Failover) │ │ (Failover) │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
Step 1: OpenAI API Backend Configuration
Setting Up the Backend
<!-- backend-policy.xml -->
<backend>
<url>https://YOUR_RESOURCE_NAME.openai.azure.com</url>
<protocol>https</protocol>
<tls>
<validate-server-certificate>true</validate-server-certificate>
</tls>
<headers>
<!-- Add API key to all requests -->
<header name="api-key">{{OPENAI_API_KEY}}</header>
<!-- Azure AD token for managed identity -->
<header name="Authorization">Bearer {{AZURE_AD_TOKEN}}</header>
</headers>
</backend>
// Configure OpenAI backend in Azure
var backend = new BackendContract
{
Name = "openai-backend",
Url = "https://my-openai.openai.azure.com",
Protocol = BackendProtocol.Http,
Properties = new BackendProperties
{
BackendUrl = "https://my-openai.openai.azure.com",
ServiceFabricCluster = null,
Description = "Azure OpenAI Service"
},
ResourceId = null,
Credentials = new BackendCredentials
{
Header = new Dictionary<string, string>
{
{"api-key", "your-api-key"} // Store in Key Vault!
}
}
};
await client.PutAsync($"/backends/openai-backend", backend);
Using Managed Identity (Recommended)
// For production, use Managed Identity instead of API keys
// 1. Enable Managed Identity on APIM
// 2. Grant APIM access to OpenAI resource
// 3. Get token and use in backend policy
<backend>
<url>https://my-resource.openai.azure.com</url>
<authorization>
<authentication-scheme>managed-identity</authentication-scheme>
</authorization>
</backend>
Step 2: Token-Based Rate Limiting
Why Token-Based Instead of Request-Based?
Request-Based vs Token-Based Rate Limiting:
Request-Based:
┌─────────────────────────────────────────────────────────────────┐
│ User makes 100 requests per hour - all counted equally │
│ │
│ "Summarize this document" = 1 request (cost: $0.01) │
│ "Write a novel" = 1 request (cost: $0.50) │
│ │
│ Problem: Easy to abuse with expensive requests │
└─────────────────────────────────────────────────────────────────┘
Token-Based:
┌─────────────────────────────────────────────────────────────────┐
│ User has 100,000 tokens per hour budget │
│ │
│ "Summarize this document" = 500 tokens (cost: $0.01) │
│ "Write a novel" = 50,000 tokens (cost: $1.00) │
│ │
│ Benefit: Fair allocation based on actual usage │
└─────────────────────────────────────────────────────────────────┘
Implementation
<inbound>
<!-- Step 1: Extract token count from request body -->
<set-variable name="prompt-tokens" value="@{
var body = context.Request.Body.As<JObject>();
if (body == null) return 0;
var messages = body["messages"] as JArray;
if (messages == null) return 0;
// Estimate tokens - roughly 1 token per 4 characters
// For more accuracy, use actual tokenizer
var content = string.Join(\" \", messages.Select(m => m[\"content\"]?.ToString() ?? \"\"));
return content.Length / 4;
}" />
<!-- Step 2: Get current usage from cache or KV store -->
<set-variable name="current-usage" value="@{
var userId = context.User?.FindFirst(\"oid\")?.Value ?? \"anonymous\";
var cacheKey = $\"token_usage_{userId}\";
// Try to get from cache
var cached = context.Variables.GetValueOrDefault<JObject>(cacheKey);
if (cached == null)
{
// Initialize new counter
return new JObject
{
[\"used\"] = 0,
[\"window_start\"] = DateTime.UtcNow.AddHours(-1),
[\"reset_at\"] = DateTime.UtcNow.AddHours(1)
};
}
// Check if window has expired
var windowStart = cached.Value<DateTime>(\"window_start\");
if (DateTime.UtcNow > windowStart.AddHours(1))
{
// Reset window
return new JObject
{
[\"used\"] = 0,
[\"window_start\"] = DateTime.UtcNow,
[\"reset_at\"] = DateTime.UtcNow.AddHours(1)
};
}
return cached;
}" />
<!-- Step 3: Check if within limits -->
<choose>
<when condition="@{
var usage = context.Variables.GetValueOrDefault<JObject>(\"current-usage\");
var used = usage.Value<long>(\"used\");
var limit = 100000L; // 100k tokens per hour
var promptTokens = context.Variables.GetValueOrDefault<long>(\"prompt-tokens\");
return (used + promptTokens) > limit;
}">
<!-- Rate limit exceeded -->
<return-response>
<set-status code="429" reason="Too Many Requests" />
<set-body>{
"error": "Token limit exceeded",
"message": "You have used {used} tokens this hour. Limit: {limit}",
"retry_after": "3600"
}</set-body>
</return-response>
</when>
</choose>
<!-- Step 4: Add usage tracking header -->
<set-header name="X-Token-Usage" exists-action="override">
<value>@{
var usage = context.Variables.GetValueOrDefault<JObject>(\"current-usage\");
return usage.Value<long>(\"used\").ToString();
}</value>
</set-header>
<base />
</inbound>
C# Implementation for Token Tracking
public class TokenRateLimitingService
{
private readonly ICacheClient _cache;
private readonly ILogger<TokenRateLimitingService> _logger;
public TokenRateLimitingService(
ICacheClient cache,
ILogger<TokenRateLimitingService> logger)
{
_cache = cache;
_logger = logger;
}
public async Task<RateLimitResult> CheckAndUpdateUsageAsync(
string userId,
long tokenCount,
long maxTokensPerHour = 100000)
{
var cacheKey = $"token_usage_{userId}";
// Get current usage
var usageJson = await _cache.GetStringAsync(cacheKey);
var usage = string.IsNullOrEmpty(usageJson)
? new TokenUsage { ResetAt = DateTime.UtcNow.AddHours(1) }
: JsonSerializer.Deserialize<TokenUsage>(usageJson);
// Check if window expired
if (DateTime.UtcNow >= usage.ResetAt)
{
usage = new TokenUsage
{
Used = 0,
ResetAt = DateTime.UtcNow.AddHours(1)
};
}
// Check limit
if (usage.Used + tokenCount > maxTokensPerHour)
{
var remaining = maxTokensPerHour - usage.Used;
_logger.LogWarning(
"User {UserId} exceeded token limit. Used: {Used}, Requested: {Requested}, Limit: {Limit}",
userId, usage.Used, tokenCount, maxTokensPerHour);
return new RateLimitResult
{
Allowed = false,
RemainingTokens = remaining,
ResetAt = usage.ResetAt,
RetryAfter = (usage.ResetAt - DateTime.UtcNow).TotalSeconds
};
}
// Update usage
usage.Used += tokenCount;
await _cache.SetStringAsync(cacheKey, JsonSerializer.Serialize(usage),
TimeSpan.FromHours(1));
_logger.LogInformation(
"User {UserId} used {Tokens} tokens. Total: {Used}, Limit: {Limit}",
userId, tokenCount, usage.Used, maxTokensPerHour);
return new RateLimitResult
{
Allowed = true,
RemainingTokens = maxTokensPerHour - usage.Used,
ResetAt = usage.ResetAt
};
}
}
public class TokenUsage
{
public long Used { get; set; }
public DateTime ResetAt { get; set; }
}
public class RateLimitResult
{
public bool Allowed { get; set; }
public long RemainingTokens { get; set; }
public DateTime ResetAt { get; set; }
public double RetryAfter { get; set; }
}
Step 3: Prompt Caching Strategy
Why Cache Prompts?
Cost Comparison - Caching:
Without Cache:
┌────────────────────────────────────────────────────────────┐
│ 1,000 users request "What is Azure?" at 9 AM │
│ │
│ 1,000 × 10 tokens × $0.001 = $10.00 per identical request │
│ │
│ Wasteful! Everyone gets the same answer │
└────────────────────────────────────────────────────────────┘
With Cache:
┌────────────────────────────────────────────────────────────┐
│ First request: Compute answer (10 tokens) → $0.01 │
│ Next 999 requests: Return cached answer (0 tokens) → $0 │
│ │
│ Total: $0.01 instead of $10.00 - 99.9% savings! │
└────────────────────────────────────────────────────────────┘
Caching Policy
<inbound>
<!-- Generate cache key based on prompt content -->
<set-variable name="cache-key" value="@{
var body = context.Request.Body.As<JObject>();
if (body == null) return null;
var messages = body[\"messages\"] as JArray;
if (messages == null) return null;
// Create deterministic cache key from messages
// Include only relevant parts for cache key
var cacheContent = string.Join(\"|\", messages.Select(m =>
$\"{(m[\"role\"] ?? \"\")}:{(m[\"content\"] ?? \"\")}\"
));
// Hash the content for consistent key
using var sha = SHA256.Create();
var hash = sha.ComputeHash(Encoding.UTF8.GetBytes(cacheContent));
return Convert.ToBase64String(hash).Substring(0, 32);
}" />
<!-- Try to get cached response -->
<cache-lookup-value key="@(context.Variables["cache-key"])" variable-name="cached-response" />
<choose>
<when condition="@(context.Variables.GetValueOrDefault<string>("cached-response") != null)">
<!-- Return cached response -->
<return-response>
<set-body>@(context.Variables["cached-response"])</set-body>
<set-header name="X-Cache" exists-action="override">
<value>HIT</value>
</set-header>
</return-response>
</when>
</choose>
<base />
</inbound>
<outbound>
<!-- Cache successful responses -->
<choose>
<when condition="@(context.Response.StatusCode == 200)">
<set-variable name="response-body" value="@(context.Response.Body.As<JObject>())" />
<!-- Only cache if response contains actual content -->
<choose>
<when condition="@(context.Variables.GetValueOrDefault<JObject>("response-body")?[\"choices\"]?.Count > 0)">
<cache-store-value key="@(context.Variables["cache-key"])"
value="@(context.Variables["response-body"])"
duration="3600" />
<set-header name="X-Cache" exists-action="override">
<value>MISS</value>
</set-header>
</when>
</choose>
</when>
</choose>
</outbound>
Smart Caching - Identify Cacheable Requests
public class CacheableRequestDetector
{
// Determine if a request should be cached
public bool ShouldCache(ChatCompletionRequest request)
{
// Don't cache if:
// 1. Uses temperature (randomness) - different response each time
if (request.Temperature > 0.5m)
return false;
// 2. Has system messages that might vary
if (request.Messages.Any(m => m.Role == "system" &&
m.Content.Contains("{{")))
return false;
// 3. Is a streaming request (can't cache partial responses)
if (request.Stream == true)
return false;
// 4. Contains user-specific context
if (request.Messages.Any(m =>
m.Content.Contains("my ") ||
m.Content.Contains("I am ")))
return false;
// Cache if: static prompts, documentation lookups, FAQs
return true;
}
// Determine cache duration based on content type
public TimeSpan GetCacheDuration(ChatCompletionRequest request)
{
var firstMessage = request.Messages.FirstOrDefault()?.Content?.ToLower() ?? "";
if (firstMessage.Contains("documentation") ||
firstMessage.Contains("help") ||
firstMessage.Contains("faq"))
{
// Documentation can be cached longer
return TimeSpan.FromHours(24);
}
if (firstMessage.Contains("what is") ||
firstMessage.Contains("define"))
{
// General knowledge questions - medium cache
return TimeSpan.FromHours(4);
}
// Default: shorter cache
return TimeSpan.FromHours(1);
}
}
Step 4: Regional Load Balancing
Multi-Region Setup
<inbound>
<!-- Select backend based on user's region -->
<set-variable name="target-region" value="@{
var region = context.Request.Headers.GetValueOrDefault(\"X-User-Region\", \"auto\");
// If explicitly specified, use that region
if (region != \"auto\") return region;
// Try to determine from IP or other headers
var forwardedRegion = context.Request.Headers.GetValueOrDefault(\"X-Forwarded-Region\");
if (!string.IsNullOrEmpty(forwardedRegion)) return forwardedRegion;
// Default to primary region
return \"eastus\";
}" />
<!-- Route to appropriate backend -->
<set-backend-service id="@{
var region = context.Variables.GetValueOrDefault<string>(\"target-region\");
return region switch
{
\"eastus\" => \"backend-eastus\",
\"westeurope\" => \"backend-westeurope\",
\"southeastasia\" => \"backend-southeastasia\",
_ => \"backend-eastus\" // Default
};
}" />
<base />
</inbound>
Health-Based Routing
public class OpenAIBackendManager
{
private readonly List<OpenAIBackend> _backends;
private readonly ILogger<OpenAIBackendManager> _logger;
public OpenAIBackendManager(IConfiguration configuration, ILogger<OpenAIBackendManager> logger)
{
_logger = logger;
// Configure your available backends
_backends = new List<OpenAIBackend>
{
new() {
Name = "eastus-primary",
Endpoint = "https://eastus-openai.openai.azure.com",
Region = "East US",
IsPrimary = true
},
new() {
Name = "westeurope-failover",
Endpoint = "https://westeurope-openai.openai.azure.com",
Region = "West Europe",
IsPrimary = false
},
new() {
Name = "southeastasia-failover",
Endpoint = "https://southeastasia.openai.azure.com",
Region = "Southeast Asia",
IsPrimary = false
}
};
}
public async Task<BackendSelectionResult> SelectBackendAsync()
{
// Try primary first
var primary = _backends.FirstOrDefault(b => b.IsPrimary);
if (await IsHealthyAsync(primary))
{
_logger.LogInformation("Using primary backend: {Name}", primary.Name);
return new BackendSelectionResult { Backend = primary, Reason = "Primary healthy" };
}
// Try fallbacks
foreach (var fallback in _backends.Where(b => !b.IsPrimary))
{
if (await IsHealthyAsync(fallback))
{
_logger.LogWarning("Primary failed, using fallback: {Name}", fallback.Name);
return new BackendSelectionResult { Backend = fallback, Reason = "Primary unhealthy" };
}
}
// All backends unhealthy - return primary anyway (better than nothing)
_logger.LogError("All backends unhealthy, using primary");
return new BackendSelectionResult { Backend = primary, Reason = "All unhealthy" };
}
private async Task<bool> IsHealthyAsync(OpenAIBackend backend)
{
try
{
// Simple health check - call the models endpoint
var client = new HttpClient { BaseAddress = new Uri(backend.Endpoint) };
client.Timeout = TimeSpan.FromSeconds(5);
var response = await client.GetAsync("/openai/models?api-version=2023-05-15");
return response.IsSuccessStatusCode;
}
catch (Exception ex)
{
_logger.LogWarning(ex, "Health check failed for {Name}", backend.Name);
return false;
}
}
}
public class OpenAIBackend
{
public string Name { get; set; }
public string Endpoint { get; set; }
public string Region { get; set; }
public bool IsPrimary { get; set; }
}
public class BackendSelectionResult
{
public OpenAIBackend Backend { get; set; }
public string Reason { get; set; }
}
Step 5: Request/Response Transformation
Standardize API Surface
<inbound>
<!-- Transform request to OpenAI format -->
<set-body>@{
var request = context.Request.Body.As<JObject();
// Our API uses 'prompt', convert to OpenAI 'messages' format
if (request[\"prompt\"] != null && request[\"messages\"] == null)
{
var prompt = request[\"prompt\"].ToString();
request[\"messages\"] = new JArray
{
new JObject
{
[\"role\"] = \"user\",
[\"content\"] = prompt
}
};
request.Remove(\"prompt\");
}
// Add default parameters if not specified
if (request[\"temperature\"] == null)
request[\"temperature\"] = 0.7m;
if (request[\"max_tokens\"] == null)
request[\"max_tokens\"] = 1000;
return request.ToString();
}" />
</inbound>
<outbound>
<!-- Transform response to our format -->
<set-body>@{
var response = context.Response.Body.As<JObject();
// Our API returns simpler format
var result = new JObject
{
[\"id\"] = response[\"id\"],
[\"created\"] = response[\"created\"],
[\"answer\"] = response[\"choices\"]?[0]?[\"message\"]?[\"content\"],
[\"usage\"] = response[\"usage\"],
[\"model\"] = response[\"model\"]
};
return result.ToString();
}" />
</outbound>
Step 6: Cost Tracking and Analytics
Usage Tracking Implementation
public class OpenAIUsageTracker
{
private readonly ITableClient _tableClient;
private readonly ILogger<OpenAIUsageTracker> _logger;
public async Task RecordUsageAsync(UsageRecord record)
{
try
{
// Store in Azure Table
await _tableClient.AddEntityAsync(new TableEntity
{
PartitionKey = record.Date.ToString("yyyy-MM-dd"),
RowKey = Guid.NewGuid().ToString(),
["UserId"] = record.UserId,
["PromptTokens"] = record.PromptTokens,
["CompletionTokens"] = record.CompletionTokens,
["TotalTokens"] = record.TotalTokens,
["Cost"] = record.Cost,
["Model"] = record.Model,
["Endpoint"] = record.Endpoint,
["Timestamp"] = DateTime.UtcNow
});
_logger.LogDebug("Recorded usage for user {UserId}: {Tokens} tokens (${Cost})",
record.UserId, record.TotalTokens, record.Cost);
}
catch (Exception ex)
{
_logger.LogError(ex, "Failed to record usage");
}
}
public async Task<CostSummary> GetCostSummaryAsync(
string userId,
DateTime from,
DateTime to)
{
// Query and aggregate usage
var query = $"PartitionKey ge '{from:yyyy-MM-dd}' and PartitionKey le '{to:yyyy-MM-dd}'";
// If user specified, filter
if (!string.IsNullOrEmpty(userId))
query += $" and UserId eq '{userId}'";
var results = new List<UsageRecord>();
await foreach (var entity in _tableClient.QueryAsync<TableEntity>(query))
{
results.Add(new UsageRecord
{
UserId = entity.GetString("UserId"),
PromptTokens = entity.GetInt32("PromptTokens"),
CompletionTokens = entity.GetInt32("CompletionTokens"),
TotalTokens = entity.GetInt32("TotalTokens"),
Cost = entity.GetDouble("Cost")
});
}
return new CostSummary
{
TotalCost = results.Sum(r => r.Cost),
TotalTokens = results.Sum(r => r.TotalTokens),
RequestCount = results.Count,
AverageCostPerRequest = results.Any()
? results.Average(r => r.Cost)
: 0
};
}
}
// Pricing (example rates - check Azure pricing)
public static class OpenAIPricing
{
// Prices per 1K tokens (example)
public const decimal GPT4_8K_Input = 0.03m;
public const decimal GPT4_8K_Output = 0.06m;
public const decimal GPT35_4K_Input = 0.001m;
public const decimal GPT35_4K_Output = 0.002m;
public static decimal CalculateCost(
string model,
int promptTokens,
int completionTokens)
{
// Determine pricing tier
var (inputPrice, outputPrice) = model.ToLower() switch
{
var m when m.Contains("gpt-4") => (GPT4_8K_Input, GPT4_8K_Output),
_ => (GPT35_4K_Input, GPT35_4K_Output)
};
var inputCost = (promptTokens / 1000m) * inputPrice;
var outputCost = (completionTokens / 1000m) * outputPrice;
return inputCost + outputCost;
}
}
public class UsageRecord
{
public string UserId { get; set; }
public DateTime Date { get; set; }
public int PromptTokens { get; set; }
public int CompletionTokens { get; set; }
public int TotalTokens => PromptTokens + CompletionTokens;
public decimal Cost { get; set; }
public string Model { get; set; }
public string Endpoint { get; set; }
}
public class CostSummary
{
public decimal TotalCost { get; set; }
public int TotalTokens { get; set; }
public int RequestCount { get; set; }
public decimal AverageCostPerRequest { get; set; }
}
Step 7: Complete Policy Example
<policies>
<inbound>
<!-- 1. Authentication -->
<validate-jwt header-name="Authorization" failed-validation-error-message="Unauthorized">
<openid-config url="https://login.microsoftonline.com/{tenant}/v2.0/.well-known/openid-configuration" />
<audiences>
<audience>api://your-app-id</audience>
</audiences>
</validate-jwt>
<!-- 2. Extract and validate prompt -->
<set-variable name="prompt-tokens" value="@{
var body = context.Request.Body.As<JObject>();
if (body == null) return 0;
var messages = body[\"messages\"] as JArray;
if (messages == null) return 0;
var content = string.Join(\" \", messages.Select(m => m[\"content\"]?.ToString() ?? \"\"));
return content.Length / 4; // Rough estimate
}" />
<!-- 3. Check rate limit -->
<set-variable name="rate-limit-check" value="@{
var userId = context.User?.FindFirst(\"oid\")?.Value;
// Call your rate limiting service
return true; // Simplified
}" />
<choose>
<when condition="@(!context.Variables.GetValueOrDefault<bool>(\"rate-limit-check\"))">
<return-response>
<set-status code="429" reason="Too Many Requests" />
<set-body>{"error": "Token rate limit exceeded"}</set-body>
</return-response>
</when>
</choose>
<!-- 4. Try cache lookup -->
<set-variable name="cache-key" value="@{/* Generate cache key */}" />
<cache-lookup-value key="@(context.Variables["cache-key"])" variable-name="cached-response" />
<choose>
<when condition="@(context.Variables.GetValueOrDefault<string>(\"cached-response\") != null)">
<return-response>
<set-body>@(context.Variables["cached-response"])</set-body>
<set-header name="X-Cache" exists-action="override"><value>HIT</value></set-header>
</return-response>
</when>
</choose>
<!-- 5. Route to backend -->
<set-backend-service id="backend-openai" />
<base />
</inbound>
<backend>
<forward-uri-keep-encode-slash>true</forward-uri-keep-encode-slash>
<base />
</backend>
<outbound>
<!-- 6. Cache successful responses -->
<choose>
<when condition="@(context.Response.StatusCode == 200)">
<cache-store-value key="@(context.Variables[\"cache-key\"])"
value="@(context.Response.Body)"
duration="3600" />
<set-header name="X-Cache" exists-action="override"><value>MISS</value></set-header>
</when>
</choose>
<!-- 7. Add usage tracking -->
<set-header name="X-Rate-Limit-Limit" exists-action="override"><value>100000</value></set-header>
<set-header name="X-Rate-Limit-Remaining" exists-action="override"><value>90000</value></set-header>
<base />
</outbound>
</policies>
Best Practices Summary
| Practice | Why | Implementation |
|---|---|---|
| Token-based rate limiting | Fair cost distribution | Track token usage, not just requests |
| Prompt caching | Reduce costs | Cache static/deterministic prompts |
| Regional routing | Low latency + redundancy | Route to nearest healthy region |
| Managed Identity | Better security | No API keys to manage |
| Usage tracking | Cost visibility | Log all requests with tokens |
| Response transformation | API consistency | Standardize across backends |
Monitoring Dashboard
// Azure Monitor queries for OpenAI usage
// Total tokens by hour
requests
| where url contains "openai" and operation_Name == "POST"
| extend promptTokens = customDimensions.PromptTokens
| summarize sum(promptTokens) by bin(timestamp, 1h)
// Cost by user
requests
| where url contains "openai"
| extend cost = customDimensions.Cost
| summarize sum(cost) by user_Id
// Cache hit rate
requests
| where url contains "openai"
| extend cacheStatus = customDimensions.CacheHit
| summarize count() by cacheStatus
| render piechart
Conclusion
Using APIM as an AI gateway provides:
- Cost Control - Token-based rate limiting prevents runaway costs
- Performance - Caching reduces latency and API calls
- Reliability - Multi-region routing with health checks
- Security - Centralized authentication and validation
- Observability - Complete usage tracking and analytics
Key takeaways:
- Implement token-based limits, not request-based
- Cache deterministic prompts for huge savings
- Use regional routing for global applications
- Track everything - you can't optimize what you don't measure
Azure Integration Hub - API Management