AI Proxy Advanced

AI License Required

Overview of capabilities

AI Proxy Advanced plugin supports capabilities across batch processing, multimodal embeddings, agents, audio, image, streaming, and more, spanning multiple providers.

For Kong Gateway versions 3.6 or earlier:

Chat Completions APIs: Multi-turn conversations with system/user/assistant roles.
Completions API: Generates free-form text from a prompt.

OpenAI has marked this endpoint as legacy and recommends using the Chat Completions API for developing new applications.

See the following table for capabilities supported in AI Gateway:

API capability	Description	Examples	OpenAI format
Chat completions	Generates conversational responses from a sequence of messages using supported LLM providers.	`llm/v1/chat`	Supported
Embeddings	Converts text to vector representations for semantic search and similarity matching.	`llm/v1/embeddings`	Supported
Function calling	Allows models to invoke external tools and APIs based on conversation context.	“`llm/v1/chat`”	Supported
Assistants and responses	Powers persistent tool-using agents and exposes metadata for debugging and evaluation.	`llm/v1/assistants` `llm/v1/responses`	Supported
Batches and files	Supports asynchronous bulk LLM requests and file uploads for long documents and structured input.	`llm/v1/batches` `llm/v1/files` Send asynchronous requests to LLMs	Supported
Audio	Enables speech-to-text, text-to-speech, and translation for voice applications.	`audio/v1/audio/transcriptions` `audio/v1/audio/speech` `audio/v1/audio/translations`	Supported
Image generation and editing	Generates or modifies images from text prompts.	`image/v1/images/generations` `image/v1/images/edits`	Supported
Video generation	Generates videos from text prompts.	`video/v1/videos/generations`	Supported
Realtime	Bidirectional WebSocket streaming for low-latency, interactive voice and text applications.	`realtime/v1/realtime`	Supported
AWS Bedrock native APIs	Enables advanced orchestration and real-time RAG via Converse and RetrieveAndGenerate endpoints. Available only when using native LLM format for Bedrock.	`/converse` `/retrieveAndGenerate`	Not supported
Hugging Face native APIs	Provides text generation and streaming using Hugging Face models. Available only when using native LLM format for Hugging Face.	`/generate`	Not supported
Rerank	Reorders documents by relevance for RAG pipelines using Bedrock or Cohere rerank APIs. Available only when using native LLM format for Bedrock and Cohere.	`/rerank`	Not supported

The following providers are supported by the legacy Completions API:

OpenAI

Azure OpenAI

Cohere

Llama2

Amazon Bedrock

Gemini

Hugging Face

Supported AI providers

AI Gateway supports proxying requests to the following AI providers. Each provider page documents supported capabilities, configuration requirements, and provider-specific details.

For detailed capability support, configuration requirements, and provider-specific limitations, see the individual provider reference pages.

👁 Image
OpenAI

👁 Image
Azure OpenAI

👁 Image
Amazon Bedrock

👁 Image
Anthropic

👁 Image
Gemini

👁 Image
Vertex AI

👁 Image
Cohere

👁 Image
Mistral

👁 Image
Hugging Face

👁 Image
Llama

👁 Image
xAI

👁 Image
Alibaba Cloud DashScope

👁 Image
Cerebras

👁 Image
DeepSeek

👁 Image
Ollama

👁 Image
Databricks

👁 Image
vLLM

How it works

The AI Proxy Advanced plugin will mediate the following for you:

Request and response formats appropriate for the configured config.targets[].model.provider and config.targets[].route_type
The following service request coordinates (unless the model is self-hosted):
- Protocol
- Host name
- Port
- Path
- HTTP method
Authentication on behalf of the Kong API consumer
Decorating the request with parameters from the config.targets.model[].options block, appropriate for the chosen provider
Recording of usage statistics of the configured LLM provider and model into your selected Kong log plugin output
Optionally, additionally recording all post-transformation request and response messages from users, to and from the configured LLM
Fulfillment of requests to self-hosted models, based on select supported format transformations

Flattening all of the provider formats allows you to standardize the manipulation of the data before and after transmission. It also allows your to provide a choice of LLMs to the Kong Gateway Consumers, using consistent request and response formats, regardless of the backend provider or model.

v3.11+ AI Proxy Advanced supports REST-based full-text responses, including RESTful endpoints such as llm/v1/responses, llm/v1/files, llm/v1/assisstants and llm/v1/batches. RESTful endpoints support CRUD operations— you can POST to create a response, GET to retrieve it, or DELETE to remove it.

Request and response formats

AI Gateway transforms requests and responses according to the configured config.targets[].model.provider and config.targets[].route_type, using the OpenAI format by default. v3.10+ To use a provider’s native format instead, set config.llm_format to a value other than openai. The plugin then passes requests upstream without transformation. See Supported native LLM formats for available options.

The following table maps each route type to its OpenAI API reference and generative AI category. See the AI provider reference pages for provider-specific details.

Route type	OpenAI API reference	Gen AI category	Min version
`llm/v1/chat`	Chat completions	`text/generation`	3.6
`llm/v1/completions`	Completions	`text/generation`	3.6
`llm/v1/embeddings`	Embeddings	`text/embeddings`	3.11
`llm/v1/files`	Files	N/A	3.11
`llm/v1/batches`	Batch	N/A	3.11
`llm/v1/assistants`	Assistants	`text/generation`	3.11
`llm/v1/responses`	Responses	`text/generation`	3.11
`realtime/v1/realtime`	Realtime	`realtime/generation`	3.11
`audio/v1/audio/speech`	Create speech	`audio/speech`	3.11
`audio/v1/audio/transcriptions`	Create transcription	`audio/transcription`	3.11
`audio/v1/audio/translations`	Create translation	`audio/transcription`	3.11
`image/v1/images/generations`	Create image	`image/generation`	3.11
`image/v1/images/edits`	Create image edit	`image/generation`	3.11
`video/v1/videos/generations`	Create video	`video/generation`	3.13

Provider-specific parameters can be passed using the extra_body field in your request. See the sample OpenAPI specification for detailed format examples.

Supported native LLM formats v3.10+

If you use a provider’s native SDK, AI Gateway v3.10+ can proxy the request and return the upstream response without payload format conversion. Set config.llm_format to a value other than openai to preserve the provider’s native request and response formats.

In this mode, AI Gateway will still provide analytics, logging, and cost calculation. When config.llm_format is set to a native format, only the corresponding provider is supported with its specific APIs.

Provider	LLM format	Native capabilities
Anthropic	`anthropic`	Messages, batch processing
Amazon Bedrock	`bedrock`	Converse, RAG (RetrieveAndGenerate), reranking, async invocation
Cohere	`cohere`	Reranking
Gemini	`gemini`	Content generation, embeddings, batches, file uploads
Vertex AI	`gemini`	Content generation, embeddings, batches, reranking, long-running predictions
Hugging Face	`huggingface`	Text generation, streaming

Load balancing

AI Proxy Advanced supports several load balancing algorithms for distributing requests across AI models:

Round-robin: Weighted traffic distribution.
Consistent-hashing: Sticky sessions based on header values.
Least-connections: Route to backends with spare capacity.
Lowest-latency: Route to fastest-responding models.
Lowest-usage: Route based on token counts or cost.
Semantic: Route based on prompt-to-model similarity.
Priority: Tiered failover across model groups.

For detailed algorithm descriptions and selection guidance, see Load balancing algorithms.

For load balancing across Gateway Upstreams and Targets instead of LLMs, see load balancing with Kong Gateway.

Retry and fallback

The AI load balancer supports configurable retries, timeouts, and failover to different models when a target is unavailable.

v3.10+ Fallback works across targets with any supported format. You can mix providers freely, for example OpenAI and Mistral. Earlier versions require compatible formats between fallback targets. For configuration details, see Retry and fallback configuration.

Client errors don’t trigger failover. To failover on additional error types, set config.balancer.failover_criteria to include HTTP codes like http_429 or http_502, and non_idempotent for POST requests.

Health check and circuit breaker v3.13+

The AI load balancer supports circuit breakers to improve reliability. If a target reaches the failure threshold defined by config.balancer.max_fails, the load balancer stops routing requests to it until the timeout period (config.balancer.fail_timeout) elapses.

For configuration details and behavior examples, see Circuit breaker.

Templating v3.7+

The plugin allows you to substitute values in the config.targets[].model.name and any parameter under config.targets.model[].options with specific placeholders, similar to those in the Request Transformer Advanced plugin.

The following templated parameters are available:

$(headers.header_name): The value of a specific request header.
$(uri_captures.path_parameter_name): The value of a captured URI path parameter.
$(query_params.query_parameter_name): The value of a query string parameter.

You can combine these parameters with an OpenAI-compatible SDK in multiple ways using the AI Proxy and AI Proxy Advanced plugins, depending on your specific use case:

Action	Description
Select different models dynamically on one provider	Allow users to select the target model based on a request header or parameter. Supports flexible routing across different models on the same provider.
Use one chat route with dynamic Azure OpenAI deployments	Configure a dynamic route to target multiple Azure OpenAI model deployments.
Use multiple routes to map mulitple Azure Deployment	Use separate Routes to map Azure OpenAI SDK requests to specific deployments of GPT-3.5 and GPT-4.

Vector databases

A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.

The AI Proxy Advanced plugin supports the following vector databases:

Using config.vectordb.strategy: redis and parameters in config.vectordb.redis:
- Redis with Vector Similarity Search (VSS)
- Redis Cloud
- Valkey v3.14+: When you configure vectordb.strategy: redis, Kong Gateway queries the server and checks the server name field. If it detects Valkey request, it automatically uses the Valkey-specific driver.
- Managed Redis with cloud authentication:
  - AWS ElastiCache (auth_provider: aws)
  - Azure Managed Redis (auth_provider: azure)
  - Google Cloud Memorystore (auth_provider: gcp)
  For configuration details, see Using cloud authentication with Redis.
Using config.vectordb.strategy: pgvector and parameters in config.vectordb.pgvector:
- PostgreSQL with pgvector v3.10+

To learn more about vector databases in AI Gateway, see Embedding-based similarity matching in Kong AI gateway plugins.

Partials v3.13+

This plugin supports all three AI Partial types, which let you define shared configuration once and reuse it across multiple AI Gateway plugins.

Partial type	Fields covered
`vectordb`	`config.vectordb`
`embeddings`	`config.embeddings`
`model`	Each element of `config.targets[]`

A model Partial applies to each entry in the config.targets array, so you can share one provider configuration across multiple targets.

For setup instructions, see AI plugin Partials.

Using cloud authentication with Redis v3.13+

If your plugin uses a Redis datastore, you can authenticate to it with a cloud Redis provider. This allows you to seamlessly rotate credentials without relying on static passwords.

The following providers are supported:

AWS ElastiCache
Azure Managed Redis
Google Cloud Memorystore (with or without Valkey)

You need:

A running Redis instance on an AWS ElastiCache instance for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
The ElastiCache user needs to set “Authentication mode” to “IAM”

The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "elasticache:Connect"
 ],
 "Resource": [
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE",
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER"
 ]
 }
 ]
}

Copied!

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 host: $INSTANCE_ADDRESS
 username: $INSTANCE_USERNAME
 port: 6379
 cloud_authentication:
 auth_provider: aws
 aws_cache_name: $AWS_CACHE_NAME
 aws_is_serverless: false
 aws_region: $AWS_REGION
 aws_access_key_id: $AWS_ACCESS_KEY_ID
 aws_secret_access_key: $AWS_ACCESS_SECRET_KEY

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The ElastiCache instance address.
$INSTANCE_USERNAME: The ElastiCache username with IAM Auth mode configured.
$AWS_CACHE_NAME: Name of your AWS ElastiCache instance.
$AWS_REGION: Your AWS ElastiCache instance region.
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID.
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.

You need:

A running Redis instance on an AWS ElastiCache cluster for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
The ElastiCache user needs to set “Authentication mode” to “IAM”

The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "elasticache:Connect"
 ],
 "Resource": [
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE",
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER"
 ]
 }
 ]
}

Copied!

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 cluster_nodes:
 - ip: $CLUSTER_ADDRESS
 port: 6379
 username: $CLUSTER_USERNAME
 port: 6379
 cloud_authentication:
 auth_provider: aws
 aws_cache_name: $AWS_CACHE_NAME
 aws_is_serverless: false
 aws_region: $AWS_REGION
 aws_access_key_id: $AWS_ACCESS_KEY_ID
 aws_secret_access_key: $AWS_ACCESS_SECRET_KEY

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The ElastiCache cluster address.
$CLUSTER_USERNAME: The ElastiCache username with IAM Auth mode configured.
$AWS_CACHE_NAME: Name of your AWS ElastiCache cluster.
$AWS_REGION: Your AWS ElastiCache cluster region.
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID.
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.

You need:

A running Redis instance on an Azure Managed Redis instance with Entra authentication configured
Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 host: $INSTANCE_ADDRESS
 username: $INSTANCE_USERNAME
 port: 10000
 cloud_authentication:
 auth_provider: azure
 azure_client_id: $AZURE_CLIENT_ID
 azure_client_secret: $AZURE_CLIENT_SECRET
 azure_tenant_id: $AZURE_TENANT_ID

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The Azure Managed Redis instance address.
$INSTANCE_USERNAME: The object (principal) ID of the Principal/Identity with essential access.
$AZURE_CLIENT_ID: The client ID of the Principal/Identity.
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity.
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.

You need:

A running Redis instance on an Azure Managed Redis cluster with Entra authentication configured
Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 cluster_nodes:
 - ip: $CLUSTER_ADDRESS
 port: 10000
 username: $CLUSTER_USERNAME
 port: 10000
 cloud_authentication:
 auth_provider: azure
 azure_client_id: $AZURE_CLIENT_ID
 azure_client_secret: $AZURE_CLIENT_SECRET
 azure_tenant_id: $AZURE_TENANT_ID

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The Azure Managed Redis cluster address.
$CLUSTER_USERNAME: The object (principal) ID of the Principal/Identity with essential access.
$AZURE_CLIENT_ID: The client ID of the Principal/Identity.
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity.
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.

You need:

A running Redis instance on an Google Cloud Memorystore instance
Assign the principal to the corresponding role:
- Cloud Memorystore Redis DB Connection User(roles/redis.dbConnectionUser) for Memorystore for Redis Cluster
- Memorystore DB Connector User (roles/memorystore.dbConnectionUser) for Memorystore for Valkey

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 host: $INSTANCE_ADDRESS
 port: 6379
 cloud_authentication:
 auth_provider: gcp
 gcp_service_account_json: $GCP_SERVICE_ACCOUNT

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The Memorystore instance address.
$GCP_SERVICE_ACCOUNT: (Optional) The GCP service account JSON.

You need:

A running Redis instance on an Google Cloud Memorystore cluster
Assign the principal to the corresponding role:
- Cloud Memorystore Redis DB Connection User(roles/redis.dbConnectionUser) for Memorystore for Redis Cluster
- Memorystore DB Connector User (roles/memorystore.dbConnectionUser) for Memorystore for Valkey

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 cluster_nodes:
 - ip: $CLUSTER_ADDRESS
 port: 6379
 port: 6379
 cloud_authentication:
 auth_provider: gcp
 gcp_service_account_json: $GCP_SERVICE_ACCOUNT

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The Memorystore cluster address.
$GCP_SERVICE_ACCOUNT: The GCP service account JSON.

FAQs

Can I override config.model.name by specifying a different model name in the request?

By default, no. The model name must match the one configured in config.model.name. If a different model is specified in the request, the plugin returns a 400 error.

However, if you set model_alias on a target, clients can send the alias value in the model field instead of the actual provider model name. The plugin matches the request to the target with the corresponding alias. See Route requests to different models using model aliases for an example.

Can I override temperature, top_p, and top_k from the request?

Yes. The values for temperature, top_p, and top_k in the request take precedence over those set in config.targets.model.options.

Can I override authentication values from the request?

Yes, but only if config.targets.auth.allow_override is set to true in the plugin configuration. When enabled, this allows request-level auth parameters (such as API keys or bearer tokens) to override the static values defined in the plugin.

What algorithm does ai-proxy-advanced use for selecting the lowest latency target?

It uses Kong’s built-in load balancing mechanism with the EWMA (Exponentially Weighted Moving Average) algorithm to dynamically route traffic to the backend with the lowest observed latency.

What is the duration of the learning phase with AI Proxy Advanced?

There’s no fixed time window. EWMA continuously updates with every response, giving more weight to recent observations. Older latencies decay over time, but still contribute in smaller proportions.

How does AI Proxy Advanced distribute traffic once a faster model is identified?

The fastest model gets a majority of traffic, but Kong never sends 100% to a single target unless it’s the only one available. In practice, the dominant target may receive ~90–99% of traffic, depending on how much better its EWMA score is.

Does the system continue testing other targets when the AI Proxy Advanced plugin identifies the fastest model?

Yes. EWMA ensures all targets continue to receive a small amount of traffic. This ongoing probing lets the system adapt if a previously slower model becomes faster later.

What’s the approximate percentage of traffic sent to non-dominant targets with AI Proxy Advanced?

While exact percentages vary with latency gaps, less performant targets typically get between 0.1%–5% of traffic, just enough to keep updating their EWMA score for comparison.

How do I resolve the MemoryDB error Number of indexes exceeds the limit?

If you see the following error in the logs:

failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)

Copied!

This means that the hardcoded MemoryDB instance limit has been reached. To resolve this, create more MemoryDB instances to handle multiple AI Proxy Advanced plugin instances.

URL: https://developer.konghq.com/plugins/ai-proxy-advanced/