AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.
The AI Proxy Advanced plugin lets you transform and proxy requests to multiple AI providers and models at the same time. This lets you set up load balancing between targets.
AI Proxy Advanced plugin accepts requests in one of a few defined and standardized OpenAI formats, translates them to the configured target format, and then transforms the response back into a standard format.
v3.10+ To use AI Proxy Advanced with non-OpenAI format without conversion, see section below for more details.
Overview of capabilities
AI Proxy Advanced plugin supports capabilities across batch processing, multimodal embeddings, agents, audio, image, streaming, and more, spanning multiple providers.
For Kong Gateway versions 3.6 or earlier:
-
Chat Completions APIs: Multi-turn conversations with system/user/assistant roles.
-
Completions API: Generates free-form text from a prompt.
OpenAI has marked this endpoint as legacy and recommends using the Chat Completions API for developing new applications.
See the following table for capabilities supported in AI Gateway:
| API capability | Description | Examples | OpenAI format |
|---|---|---|---|
| Chat completions | Generates conversational responses from a sequence of messages using supported LLM providers. | Supported | |
| Embeddings | Converts text to vector representations for semantic search and similarity matching. | Supported | |
| Function calling | Allows models to invoke external tools and APIs based on conversation context. |
|
Supported |
| Assistants and responses | Powers persistent tool-using agents and exposes metadata for debugging and evaluation. | Supported | |
| Batches and files | Supports asynchronous bulk LLM requests and file uploads for long documents and structured input. | Supported | |
| Audio | Enables speech-to-text, text-to-speech, and translation for voice applications. | Supported | |
| Image generation and editing | Generates or modifies images from text prompts. | Supported | |
| Video generation | Generates videos from text prompts. | Supported | |
| Realtime | Bidirectional WebSocket streaming for low-latency, interactive voice and text applications. | Supported | |
| AWS Bedrock native APIs |
Enables advanced orchestration and real-time RAG via Converse and RetrieveAndGenerate endpoints.
Available only when using native LLM format for Bedrock. |
Not supported | |
| Hugging Face native APIs |
Provides text generation and streaming using Hugging Face models.
Available only when using native LLM format for Hugging Face. |
Not supported | |
| Rerank |
Reorders documents by relevance for RAG pipelines using Bedrock or Cohere rerank APIs.
Available only when using native LLM format for Bedrock and Cohere. |
Not supported |
The following providers are supported by the legacy Completions API:
- OpenAI
- Azure OpenAI
- Cohere
- Llama2
- Amazon Bedrock
- Gemini
- Hugging Face
Supported AI providers
AI Gateway supports proxying requests to the following AI providers. Each provider page documents supported capabilities, configuration requirements, and provider-specific details.
For detailed capability support, configuration requirements, and provider-specific limitations, see the individual provider reference pages.
How it works
The AI Proxy Advanced plugin will mediate the following for you:
- Request and response formats appropriate for the configured
config.targets[].model.providerandconfig.targets[].route_type - The following service request coordinates (unless the model is self-hosted):
- Protocol
- Host name
- Port
- Path
- HTTP method
- Authentication on behalf of the Kong API consumer
- Decorating the request with parameters from the
config.targets.model[].optionsblock, appropriate for the chosen provider - Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
- Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
- Fulfillment of requests to self-hosted models, based on select supported format transformations
Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.
v3.11+ AI Proxy Advanced supports REST-based full-text responses, including RESTful endpoints such as
llm/v1/responses,llm/v1/files,llm/v1/assisstantsandllm/v1/batches. RESTful endpoints support CRUD operations— you canPOSTto create a response,GETto retrieve it, orDELETEto remove it.
Request and response formats
AI Gateway transforms requests and responses according to the configured config.targets[].model.provider and config.targets[].route_type, using the OpenAI format by default. v3.10+ To use a provider’s native format instead, set config.llm_format to a value other than openai. The plugin then passes requests upstream without transformation. See Supported native LLM formats for available options.
The following table maps each route type to its OpenAI API reference and generative AI category. See the AI provider reference pages for provider-specific details.
|
Route type |
OpenAI API reference |
Gen AI category |
Min version |
|---|---|---|---|
llm/v1/chat
|
Chat completions |
text/generation
|
3.6 |
llm/v1/completions
|
Completions |
text/generation
|
3.6 |
llm/v1/embeddings
|
Embeddings |
text/embeddings
|
3.11 |
llm/v1/files
|
Files | N/A | 3.11 |
llm/v1/batches
|
Batch | N/A | 3.11 |
llm/v1/assistants
|
Assistants |
text/generation
|
3.11 |
llm/v1/responses
|
Responses |
text/generation
|
3.11 |
realtime/v1/realtime
|
Realtime |
realtime/generation
|
3.11 |
audio/v1/audio/speech
|
Create speech |
audio/speech
|
3.11 |
audio/v1/audio/transcriptions
|
Create transcription |
audio/transcription
|
3.11 |
audio/v1/audio/translations
|
Create translation |
audio/transcription
|
3.11 |
image/v1/images/generations
|
Create image |
image/generation
|
3.11 |
image/v1/images/edits
|
Create image edit |
image/generation
|
3.11 |
video/v1/videos/generations
|
Create video |
video/generation
|
3.13 |
Provider-specific parameters can be passed using the
extra_bodyfield in your request. See the sample OpenAPI specification for detailed format examples.
Supported native LLM formats v3.10+
If you use a provider’s native SDK, AI Gateway v3.10+ can proxy the request and return the upstream response without payload format conversion. Set config.llm_format to a value other than openai to preserve the provider’s native request and response formats.
In this mode, AI Gateway will still provide analytics, logging, and cost calculation.
When config.llm_format is set to a native format, only the corresponding provider is supported with its specific APIs.
|
Provider |
LLM format |
Native capabilities |
|---|---|---|
| Anthropic |
anthropic
|
Messages, batch processing |
| Amazon Bedrock |
bedrock
|
Converse, RAG (RetrieveAndGenerate), reranking, async invocation |
| Cohere |
cohere
|
Reranking |
| Gemini |
gemini
|
Content generation, embeddings, batches, file uploads |
| Vertex AI |
gemini
|
Content generation, embeddings, batches, reranking, long-running predictions |
| Hugging Face |
huggingface
|
Text generation, streaming |
Load balancing
AI Proxy Advanced supports several load balancing algorithms for distributing requests across AI models:
- Round-robin: Weighted traffic distribution.
- Consistent-hashing: Sticky sessions based on header values.
- Least-connections: Route to backends with spare capacity.
- Lowest-latency: Route to fastest-responding models.
- Lowest-usage: Route based on token counts or cost.
- Semantic: Route based on prompt-to-model similarity.
- Priority: Tiered failover across model groups.
For detailed algorithm descriptions and selection guidance, see Load balancing algorithms.
For load balancing across Gateway Upstreams and Targets instead of LLMs, see load balancing with Kong Gateway.
Retry and fallback
The AI load balancer supports configurable retries, timeouts, and failover to different models when a target is unavailable.
v3.10+ Fallback works across targets with any supported format. You can mix providers freely, for example OpenAI and Mistral. Earlier versions require compatible formats between fallback targets. For configuration details, see Retry and fallback configuration.
Client errors don’t trigger failover. To failover on additional error types, set
config.balancer.failover_criteriato include HTTP codes likehttp_429orhttp_502, andnon_idempotentfor POST requests.
Health check and circuit breaker v3.13+
The AI load balancer supports circuit breakers to improve reliability. If a target reaches the failure threshold defined by config.balancer.max_fails, the load balancer stops routing requests to it until the timeout period (config.balancer.fail_timeout) elapses.
For configuration details and behavior examples, see Circuit breaker.
Templating v3.7+
The plugin allows you to substitute values in the config.targets[].model.name and any parameter under config.targets.model[].options
with specific placeholders, similar to those in the Request Transformer Advanced plugin.
The following templated parameters are available:
-
$(headers.header_name): The value of a specific request header. -
$(uri_captures.path_parameter_name): The value of a captured URI path parameter. -
$(query_params.query_parameter_name): The value of a query string parameter.
You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy and AI Proxy Advanced plugins, depending on your specific use case:
|
Action |
Description |
|---|---|
| Select different models dynamically on one provider | Allow users to select the target model based on a request header or parameter. Supports flexible routing across different models on the same provider. |
| Use one chat route with dynamic Azure OpenAI deployments | Configure a dynamic route to target multiple Azure OpenAI model deployments. |
| Use multiple routes to map mulitple Azure Deployment | Use separate Routes to map Azure OpenAI SDK requests to specific deployments of GPT-3.5 and GPT-4. |
Vector databases
A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.
The AI Proxy Advanced plugin supports the following vector databases:
- Using
config.vectordb.strategy: redisand parameters inconfig.vectordb.redis:- Redis with Vector Similarity Search (VSS)
- Redis Cloud
-
Valkey v3.14+: When you configure
vectordb.strategy: redis, Kong Gateway queries the server and checks the server name field. If it detects Valkey request, it automatically uses the Valkey-specific driver. - Managed Redis with cloud authentication:
-
AWS ElastiCache (
auth_provider: aws) -
Azure Managed Redis (
auth_provider: azure) -
Google Cloud Memorystore (
auth_provider: gcp)
For configuration details, see Using cloud authentication with Redis.
-
AWS ElastiCache (
- Using
config.vectordb.strategy: pgvectorand parameters inconfig.vectordb.pgvector:- PostgreSQL with pgvector v3.10+
To learn more about vector databases in AI Gateway, see Embedding-based similarity matching in Kong AI gateway plugins.
Partials v3.13+
This plugin supports all three AI Partial types, which let you define shared configuration once and reuse it across multiple AI Gateway plugins.
|
Partial type |
Fields covered |
|---|---|
vectordb
|
config.vectordb
|
embeddings
|
config.embeddings
|
model
|
Each element of config.targets[]
|
A model Partial applies to each entry in the config.targets array, so you can share one provider configuration across multiple targets.
For setup instructions, see AI plugin Partials.
Using cloud authentication with Redis v3.13+
If your plugin uses a Redis datastore, you can authenticate to it with a cloud Redis provider. This allows you to seamlessly rotate credentials without relying on static passwords.
The following providers are supported:
- AWS ElastiCache
- Azure Managed Redis
- Google Cloud Memorystore (with or without Valkey)
You need:
- A running Redis instance on an AWS ElastiCache instance for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
- The ElastiCache user needs to set “Authentication mode” to “IAM”
- The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elasticache:Connect" ], "Resource": [ "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE", "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER" ] } ] }Copied!
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
host: $INSTANCE_ADDRESS
username: $INSTANCE_USERNAME
port: 6379
cloud_authentication:
auth_provider: aws
aws_cache_name: $AWS_CACHE_NAME
aws_is_serverless: false
aws_region: $AWS_REGION
aws_access_key_id: $AWS_ACCESS_KEY_ID
aws_secret_access_key: $AWS_ACCESS_SECRET_KEYReplace the following with your actual values:
-
$INSTANCE_ADDRESS: The ElastiCache instance address. -
$INSTANCE_USERNAME: The ElastiCache username with IAM Auth mode configured. -
$AWS_CACHE_NAME: Name of your AWS ElastiCache instance. -
$AWS_REGION: Your AWS ElastiCache instance region. -
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID. -
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.
You need:
- A running Redis instance on an AWS ElastiCache cluster for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
- The ElastiCache user needs to set “Authentication mode” to “IAM”
- The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elasticache:Connect" ], "Resource": [ "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE", "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER" ] } ] }Copied!
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
cluster_nodes:
- ip: $CLUSTER_ADDRESS
port: 6379
username: $CLUSTER_USERNAME
port: 6379
cloud_authentication:
auth_provider: aws
aws_cache_name: $AWS_CACHE_NAME
aws_is_serverless: false
aws_region: $AWS_REGION
aws_access_key_id: $AWS_ACCESS_KEY_ID
aws_secret_access_key: $AWS_ACCESS_SECRET_KEYReplace the following with your actual values:
-
$CLUSTER_ADDRESS: The ElastiCache cluster address. -
$CLUSTER_USERNAME: The ElastiCache username with IAM Auth mode configured. -
$AWS_CACHE_NAME: Name of your AWS ElastiCache cluster. -
$AWS_REGION: Your AWS ElastiCache cluster region. -
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID. -
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.
You need:
- A running Redis instance on an Azure Managed Redis instance with Entra authentication configured
- Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
host: $INSTANCE_ADDRESS
username: $INSTANCE_USERNAME
port: 10000
cloud_authentication:
auth_provider: azure
azure_client_id: $AZURE_CLIENT_ID
azure_client_secret: $AZURE_CLIENT_SECRET
azure_tenant_id: $AZURE_TENANT_IDReplace the following with your actual values:
-
$INSTANCE_ADDRESS: The Azure Managed Redis instance address. -
$INSTANCE_USERNAME: The object (principal) ID of the Principal/Identity with essential access. -
$AZURE_CLIENT_ID: The client ID of the Principal/Identity. -
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity. -
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.
You need:
- A running Redis instance on an Azure Managed Redis cluster with Entra authentication configured
- Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
cluster_nodes:
- ip: $CLUSTER_ADDRESS
port: 10000
username: $CLUSTER_USERNAME
port: 10000
cloud_authentication:
auth_provider: azure
azure_client_id: $AZURE_CLIENT_ID
azure_client_secret: $AZURE_CLIENT_SECRET
azure_tenant_id: $AZURE_TENANT_IDReplace the following with your actual values:
-
$CLUSTER_ADDRESS: The Azure Managed Redis cluster address. -
$CLUSTER_USERNAME: The object (principal) ID of the Principal/Identity with essential access. -
$AZURE_CLIENT_ID: The client ID of the Principal/Identity. -
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity. -
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.
You need:
- A running Redis instance on an Google Cloud Memorystore instance
- Assign the principal to the corresponding role:
-
Cloud Memorystore Redis DB Connection User(
roles/redis.dbConnectionUser) for Memorystore for Redis Cluster -
Memorystore DB Connector User (
roles/memorystore.dbConnectionUser) for Memorystore for Valkey
-
Cloud Memorystore Redis DB Connection User(
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
host: $INSTANCE_ADDRESS
port: 6379
cloud_authentication:
auth_provider: gcp
gcp_service_account_json: $GCP_SERVICE_ACCOUNTReplace the following with your actual values:
-
$INSTANCE_ADDRESS: The Memorystore instance address. -
$GCP_SERVICE_ACCOUNT: (Optional) The GCP service account JSON.
You need:
- A running Redis instance on an Google Cloud Memorystore cluster
- Assign the principal to the corresponding role:
-
Cloud Memorystore Redis DB Connection User(
roles/redis.dbConnectionUser) for Memorystore for Redis Cluster -
Memorystore DB Connector User (
roles/memorystore.dbConnectionUser) for Memorystore for Valkey
-
Cloud Memorystore Redis DB Connection User(
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
cluster_nodes:
- ip: $CLUSTER_ADDRESS
port: 6379
port: 6379
cloud_authentication:
auth_provider: gcp
gcp_service_account_json: $GCP_SERVICE_ACCOUNTReplace the following with your actual values:
-
$CLUSTER_ADDRESS: The Memorystore cluster address. -
$GCP_SERVICE_ACCOUNT: The GCP service account JSON.
FAQs
Can I override config.model.name by specifying a different model name in the request?
By default, no. The model name must match the one configured in config.model.name. If a different model is specified in the request, the plugin returns a 400 error.
However, if you set model_alias on a target, clients can send the alias value in the model field instead of the actual provider model name. The plugin matches the request to the target with the corresponding alias. See Route requests to different models using model aliases for an example.
Can I override temperature, top_p, and top_k from the request?
Yes. The values for temperature, top_p, and top_k in the request take precedence over those set in config.targets.model.options.
Can I override authentication values from the request?
Yes, but only if config.targets.auth.allow_override is set to true in the plugin configuration.
When enabled, this allows request-level auth parameters (such as API keys or bearer tokens) to override the static values defined in the plugin.
What algorithm does ai-proxy-advanced use for selecting the lowest latency target?
It uses Kong’s built-in load balancing mechanism with the EWMA (Exponentially Weighted Moving Average) algorithm to dynamically route traffic to the backend with the lowest observed latency.
What is the duration of the learning phase with AI Proxy Advanced?
There’s no fixed time window. EWMA continuously updates with every response, giving more weight to recent observations. Older latencies decay over time, but still contribute in smaller proportions.
How does AI Proxy Advanced distribute traffic once a faster model is identified?
The fastest model gets a majority of traffic, but Kong never sends 100% to a single target unless it’s the only one available. In practice, the dominant target may receive ~90–99% of traffic, depending on how much better its EWMA score is.
Does the system continue testing other targets when the AI Proxy Advanced plugin identifies the fastest model?
Yes. EWMA ensures all targets continue to receive a small amount of traffic. This ongoing probing lets the system adapt if a previously slower model becomes faster later.
What’s the approximate percentage of traffic sent to non-dominant targets with AI Proxy Advanced?
While exact percentages vary with latency gaps, less performant targets typically get between 0.1%–5% of traffic, just enough to keep updating their EWMA score for comparison.
How do I resolve the MemoryDB error Number of indexes exceeds the limit?
If you see the following error in the logs:
failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)This means that the hardcoded MemoryDB instance limit has been reached. To resolve this, create more MemoryDB instances to handle multiple AI Proxy Advanced plugin instances.
