AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.
The AI Semantic Cache plugin stores user requests to an LLM in a vector database based on semantic meaning. When a similar query is made, it uses these embeddings to retrieve relevant cached requests efficiently.
What is semantic caching?
Semantic caching enhances data retrieval efficiency by focusing on the meaning or context of queries rather than just exact matches. It stores requests based on the underlying intent and semantic similarities between different queries and can then retrieve those cached queries when a similar request is made.
When a new request is made, the system can retrieve and reuse previously cached requests if they are contextually relevant, even if the phrasing is different. This method reduces redundant processing, speeds up response times, and ensures that answers are more relevant to the user’s intent, ultimately improving overall system performance and user experience.
For example, if a user asks, “how to integrate our API with a mobile app” and later asks, “what are the steps for connecting our API to a smartphone application?”, the system understands that both questions are asking for the same information. It can then retrieve and reuse previously cached responses, even if the wording is different. This approach reduces processing time and speeds up responses.
The AI Semantic Cache plugin may not be ideal if the following are true:
- You have limited hardware or budget. Storing semantic vectors and running similarity searches require a lot of storage and computing power, which could be an issue.
- Your data doesn’t rely on semantics, or exact matches work fine. In this case, semantic caching may offer little benefit. Traditional or keyword-based caching might be more efficient.
How it works
Semantic caching with the AI Semantic Cache plugin involves three parts: request handling, embedding generation, and response caching.
First, a user starts a chat request with the LLM. The AI Semantic Cache plugin queries the vector database to see if there are any semantically similar requests that have already been cached. If there is a match, the vector database returns the cached response to the user.
sequenceDiagram actor User participant Kong Gateway/AI Semantic Cache plugin participant Vector database User->>Kong Gateway/AI Semantic Cache plugin: LLM chat request Kong Gateway/AI Semantic Cache plugin->>Vector database: Query for semantically similar previous requests Vector database-->>User: If response, return it or stream it back
If there isn’t a match, the AI Semantic Cache plugin prompts the embeddings LLM to generate an embedding for the response.
sequenceDiagram participant Kong Gateway/AI Semantic Cache plugin participant Embeddings LLM Kong Gateway/AI Semantic Cache plugin->>Embeddings LLM: Generate embeddings for `config.message_countback` messages Embeddings LLM-->>Kong Gateway/AI Semantic Cache plugin: Return embeddings
The AI Semantic Cache plugin uses a vector database and cache to store responses to requests. The plugin can then retrieve a cached response if a new request matches the semantics of a previous request, or it can tell the vector database to store a new response if there are no matches.
sequenceDiagram participant Kong Gateway/AI Semantic Cache plugin participant Prompt/Chat LLM participant Vector database actor User Kong Gateway/AI Semantic Cache plugin->>Prompt/Chat LLM: Make LLM request Prompt/Chat LLM-->>Kong Gateway/AI Semantic Cache plugin: Receive response Kong Gateway/AI Semantic Cache plugin->>Vector database: Store vectors Kong Gateway/AI Semantic Cache plugin->>Vector database: Store response message options Kong Gateway/AI Semantic Cache plugin-->>User: Return realtime response
Cache management
With the AI Semantic Cache plugin, you can configure a cache of your choice to store the responses from the LLM.
The AI Semantic Cache plugin supports Redis as a cache.
Caching mechanisms
The AI Semantic Cache plugin improves how AI systems provide responses by using two kinds of caching mechanisms:
- Exact Caching: This stores precise, unaltered responses for specific queries. If a user asks the same question multiple times, the system can quickly retrieve the pre-stored response rather than generating it again each time. This speeds up response times and reduces computational load.
- Semantic Caching: This approach is more flexible and involves storing responses based on the meaning or intent behind the queries. Instead of relying on exact matches, the system can understand and reuse information that is conceptually similar. For instance, if a user asks about “Italian restaurants in New York City” and later about “New York City Italian cuisine,” semantic caching can help provide relevant information based on their related meanings.
Together, these caching methods enhance the efficiency and relevance of AI responses, making interactions faster and more contextually accurate.
When Exact Caching is enabled, the AI Semantic Cache plugin may still return results for queries that are similar but not identical. This is expected behavior: the plugin performs similarity-based caching regardless of the Exact Caching setting.
Headers sent to the client
When the AI Semantic Cache plugin is active, Kong Gateway sends additional headers indicating the cache status and other relevant information:
X-Cache-Status: Hit
X-Cache-Status: Miss
X-Cache-Status: Bypass
X-Cache-Status: Refresh
X-Cache-Key: <cache_key>
X-Cache-Ttl: <ttl>
Age: <age>These headers help clients understand whether a response was served from the cache, if the cache key was used, the remaining time-to-live, and the age of the cached response.
Cache control headers
The plugin respects cache control headers to determine if requests and responses should be cached or not. It supports the following directives:
-
no-store: Prevents caching of the request or response -
no-cache: Forces validation with the origin server before serving the cached response -
private: Ensures the response is not cached by shared caches -
max-ageands-maxage: Sets the maximum age of the cached response. This causes the vector database to drop and delete the cached response message after expiration, so it’s never seen again.
As most AI services always send
no-cachein the response headers, settingcache_controltotruewill always result in a cache bypass. Only consider settingno-cacheif you are using self-hosted services and have control over the response Cache Control headers.
Partials v3.13+
This plugin supports vectordb and embeddings Partials, which let you define shared vector database and embeddings configuration once and reuse it across multiple AI Gateway plugins. This is useful when running this plugin alongside others that use the same vector database and embeddings model, such as AI Semantic Cache, AI RAG Injector, AI Semantic Prompt Guard, AI Semantic Response Guard, and AI Proxy Advanced.
|
Partial type |
Fields covered |
|---|---|
vectordb
|
config.vectordb
|
embeddings
|
config.embeddings
|
For setup instructions, see AI plugin Partials.
Vector databases
A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.
The AI Semantic Cache plugin supports the following vector databases:
- Using
config.vectordb.strategy: redisand parameters inconfig.vectordb.redis:- Redis with Vector Similarity Search (VSS)
- Redis Cloud
-
Valkey v3.14+: When you configure
vectordb.strategy: redis, Kong Gateway queries the server and checks the server name field. If it detects Valkey request, it automatically uses the Valkey-specific driver. - Managed Redis with cloud authentication:
-
AWS ElastiCache (
auth_provider: aws) -
Azure Managed Redis (
auth_provider: azure) -
Google Cloud Memorystore (
auth_provider: gcp)
For configuration details, see Using cloud authentication with Redis.
-
AWS ElastiCache (
- Using
config.vectordb.strategy: pgvectorand parameters inconfig.vectordb.pgvector:- PostgreSQL with pgvector v3.10+
To learn more about vector databases in AI Gateway, see Embedding-based similarity matching in Kong AI gateway plugins.
Using cloud authentication with Redis v3.13+
If your plugin uses a Redis datastore, you can authenticate to it with a cloud Redis provider. This allows you to seamlessly rotate credentials without relying on static passwords.
The following providers are supported:
- AWS ElastiCache
- Azure Managed Redis
- Google Cloud Memorystore (with or without Valkey)
You need:
- A running Redis instance on an AWS ElastiCache instance for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
- The ElastiCache user needs to set “Authentication mode” to “IAM”
- The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elasticache:Connect" ], "Resource": [ "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE", "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER" ] } ] }Copied!
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
host: $INSTANCE_ADDRESS
username: $INSTANCE_USERNAME
port: 6379
cloud_authentication:
auth_provider: aws
aws_cache_name: $AWS_CACHE_NAME
aws_is_serverless: false
aws_region: $AWS_REGION
aws_access_key_id: $AWS_ACCESS_KEY_ID
aws_secret_access_key: $AWS_ACCESS_SECRET_KEYReplace the following with your actual values:
-
$INSTANCE_ADDRESS: The ElastiCache instance address. -
$INSTANCE_USERNAME: The ElastiCache username with IAM Auth mode configured. -
$AWS_CACHE_NAME: Name of your AWS ElastiCache instance. -
$AWS_REGION: Your AWS ElastiCache instance region. -
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID. -
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.
You need:
- A running Redis instance on an AWS ElastiCache cluster for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
- The ElastiCache user needs to set “Authentication mode” to “IAM”
- The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "elasticache:Connect" ], "Resource": [ "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE", "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER" ] } ] }Copied!
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
cluster_nodes:
- ip: $CLUSTER_ADDRESS
port: 6379
username: $CLUSTER_USERNAME
port: 6379
cloud_authentication:
auth_provider: aws
aws_cache_name: $AWS_CACHE_NAME
aws_is_serverless: false
aws_region: $AWS_REGION
aws_access_key_id: $AWS_ACCESS_KEY_ID
aws_secret_access_key: $AWS_ACCESS_SECRET_KEYReplace the following with your actual values:
-
$CLUSTER_ADDRESS: The ElastiCache cluster address. -
$CLUSTER_USERNAME: The ElastiCache username with IAM Auth mode configured. -
$AWS_CACHE_NAME: Name of your AWS ElastiCache cluster. -
$AWS_REGION: Your AWS ElastiCache cluster region. -
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID. -
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.
You need:
- A running Redis instance on an Azure Managed Redis instance with Entra authentication configured
- Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
host: $INSTANCE_ADDRESS
username: $INSTANCE_USERNAME
port: 10000
cloud_authentication:
auth_provider: azure
azure_client_id: $AZURE_CLIENT_ID
azure_client_secret: $AZURE_CLIENT_SECRET
azure_tenant_id: $AZURE_TENANT_IDReplace the following with your actual values:
-
$INSTANCE_ADDRESS: The Azure Managed Redis instance address. -
$INSTANCE_USERNAME: The object (principal) ID of the Principal/Identity with essential access. -
$AZURE_CLIENT_ID: The client ID of the Principal/Identity. -
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity. -
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.
You need:
- A running Redis instance on an Azure Managed Redis cluster with Entra authentication configured
- Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
cluster_nodes:
- ip: $CLUSTER_ADDRESS
port: 10000
username: $CLUSTER_USERNAME
port: 10000
cloud_authentication:
auth_provider: azure
azure_client_id: $AZURE_CLIENT_ID
azure_client_secret: $AZURE_CLIENT_SECRET
azure_tenant_id: $AZURE_TENANT_IDReplace the following with your actual values:
-
$CLUSTER_ADDRESS: The Azure Managed Redis cluster address. -
$CLUSTER_USERNAME: The object (principal) ID of the Principal/Identity with essential access. -
$AZURE_CLIENT_ID: The client ID of the Principal/Identity. -
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity. -
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.
You need:
- A running Redis instance on an Google Cloud Memorystore instance
- Assign the principal to the corresponding role:
-
Cloud Memorystore Redis DB Connection User(
roles/redis.dbConnectionUser) for Memorystore for Redis Cluster -
Memorystore DB Connector User (
roles/memorystore.dbConnectionUser) for Memorystore for Valkey
-
Cloud Memorystore Redis DB Connection User(
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
host: $INSTANCE_ADDRESS
port: 6379
cloud_authentication:
auth_provider: gcp
gcp_service_account_json: $GCP_SERVICE_ACCOUNTReplace the following with your actual values:
-
$INSTANCE_ADDRESS: The Memorystore instance address. -
$GCP_SERVICE_ACCOUNT: (Optional) The GCP service account JSON.
You need:
- A running Redis instance on an Google Cloud Memorystore cluster
- Assign the principal to the corresponding role:
-
Cloud Memorystore Redis DB Connection User(
roles/redis.dbConnectionUser) for Memorystore for Redis Cluster -
Memorystore DB Connector User (
roles/memorystore.dbConnectionUser) for Memorystore for Valkey
-
Cloud Memorystore Redis DB Connection User(
To configure cloud authentication with Redis, add the following parameters to your plugin configuration:
config:
vectordb:
strategy: redis
redis:
cluster_nodes:
- ip: $CLUSTER_ADDRESS
port: 6379
port: 6379
cloud_authentication:
auth_provider: gcp
gcp_service_account_json: $GCP_SERVICE_ACCOUNTReplace the following with your actual values:
-
$CLUSTER_ADDRESS: The Memorystore cluster address. -
$GCP_SERVICE_ACCOUNT: The GCP service account JSON.
FAQs
How do I resolve the MemoryDB error Number of indexes exceeds the limit?
If you see the following error in the logs:
failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)This means that the hardcoded MemoryDB instance limit has been reached. To resolve this, create more MemoryDB instances to handle multiple AI Semantic Cache plugin instances.
