AI Semantic Cache

AI License Required

What is semantic caching?

Semantic caching enhances data retrieval efficiency by focusing on the meaning or context of queries rather than just exact matches. It stores requests based on the underlying intent and semantic similarities between different queries and can then retrieve those cached queries when a similar request is made.

When a new request is made, the system can retrieve and reuse previously cached requests if they are contextually relevant, even if the phrasing is different. This method reduces redundant processing, speeds up response times, and ensures that answers are more relevant to the user’s intent, ultimately improving overall system performance and user experience.

For example, if a user asks, “how to integrate our API with a mobile app” and later asks, “what are the steps for connecting our API to a smartphone application?”, the system understands that both questions are asking for the same information. It can then retrieve and reuse previously cached responses, even if the wording is different. This approach reduces processing time and speeds up responses.

The AI Semantic Cache plugin may not be ideal if the following are true:

You have limited hardware or budget. Storing semantic vectors and running similarity searches require a lot of storage and computing power, which could be an issue.
Your data doesn’t rely on semantics, or exact matches work fine. In this case, semantic caching may offer little benefit. Traditional or keyword-based caching might be more efficient.

How it works

Semantic caching with the AI Semantic Cache plugin involves three parts: request handling, embedding generation, and response caching.

First, a user starts a chat request with the LLM. The AI Semantic Cache plugin queries the vector database to see if there are any semantically similar requests that have already been cached. If there is a match, the vector database returns the cached response to the user.

 
sequenceDiagram
 actor User
 participant Kong Gateway/AI Semantic Cache plugin
 participant Vector database

 User->>Kong Gateway/AI Semantic Cache plugin: LLM chat request
 Kong Gateway/AI Semantic Cache plugin->>Vector database: Query for semantically similar previous requests
 Vector database-->>User: If response, return it or stream it back

If there isn’t a match, the AI Semantic Cache plugin prompts the embeddings LLM to generate an embedding for the response.

 
sequenceDiagram
 participant Kong Gateway/AI Semantic Cache plugin
 participant Embeddings LLM

 Kong Gateway/AI Semantic Cache plugin->>Embeddings LLM: Generate embeddings for `config.message_countback` messages
 Embeddings LLM-->>Kong Gateway/AI Semantic Cache plugin: Return embeddings

The AI Semantic Cache plugin uses a vector database and cache to store responses to requests. The plugin can then retrieve a cached response if a new request matches the semantics of a previous request, or it can tell the vector database to store a new response if there are no matches.

 
sequenceDiagram
 participant Kong Gateway/AI Semantic Cache plugin
 participant Prompt/Chat LLM
 participant Vector database
 actor User

 Kong Gateway/AI Semantic Cache plugin->>Prompt/Chat LLM: Make LLM request
 Prompt/Chat LLM-->>Kong Gateway/AI Semantic Cache plugin: Receive response
 Kong Gateway/AI Semantic Cache plugin->>Vector database: Store vectors
 Kong Gateway/AI Semantic Cache plugin->>Vector database: Store response message options
 Kong Gateway/AI Semantic Cache plugin-->>User: Return realtime response

Cache management

With the AI Semantic Cache plugin, you can configure a cache of your choice to store the responses from the LLM.

The AI Semantic Cache plugin supports Redis as a cache.

Caching mechanisms

The AI Semantic Cache plugin improves how AI systems provide responses by using two kinds of caching mechanisms:

Exact Caching: This stores precise, unaltered responses for specific queries. If a user asks the same question multiple times, the system can quickly retrieve the pre-stored response rather than generating it again each time. This speeds up response times and reduces computational load.
Semantic Caching: This approach is more flexible and involves storing responses based on the meaning or intent behind the queries. Instead of relying on exact matches, the system can understand and reuse information that is conceptually similar. For instance, if a user asks about “Italian restaurants in New York City” and later about “New York City Italian cuisine,” semantic caching can help provide relevant information based on their related meanings.

Together, these caching methods enhance the efficiency and relevance of AI responses, making interactions faster and more contextually accurate.

When Exact Caching is enabled, the AI Semantic Cache plugin may still return results for queries that are similar but not identical. This is expected behavior: the plugin performs similarity-based caching regardless of the Exact Caching setting.

Headers sent to the client

When the AI Semantic Cache plugin is active, Kong Gateway sends additional headers indicating the cache status and other relevant information:

X-Cache-Status: Hit
X-Cache-Status: Miss
X-Cache-Status: Bypass
X-Cache-Status: Refresh
X-Cache-Key: <cache_key>
X-Cache-Ttl: <ttl>
Age: <age>

These headers help clients understand whether a response was served from the cache, if the cache key was used, the remaining time-to-live, and the age of the cached response.

Cache control headers

The plugin respects cache control headers to determine if requests and responses should be cached or not. It supports the following directives:

no-store: Prevents caching of the request or response
no-cache: Forces validation with the origin server before serving the cached response
private: Ensures the response is not cached by shared caches
max-age and s-maxage: Sets the maximum age of the cached response. This causes the vector database to drop and delete the cached response message after expiration, so it’s never seen again.

As most AI services always send no-cache in the response headers, setting cache_control to true will always result in a cache bypass. Only consider setting no-cache if you are using self-hosted services and have control over the response Cache Control headers.

Partials v3.13+

This plugin supports vectordb and embeddings Partials, which let you define shared vector database and embeddings configuration once and reuse it across multiple AI Gateway plugins. This is useful when running this plugin alongside others that use the same vector database and embeddings model, such as AI Semantic Cache, AI RAG Injector, AI Semantic Prompt Guard, AI Semantic Response Guard, and AI Proxy Advanced.

Partial type	Fields covered
`vectordb`	`config.vectordb`
`embeddings`	`config.embeddings`

For setup instructions, see AI plugin Partials.

Vector databases

A vector database can be used to store vector embeddings, or numerical representations, of data items. For example, a response would be converted to a numerical representation and stored in the vector database so that it can compare new requests against the stored vectors to find relevant cached items.

The AI Semantic Cache plugin supports the following vector databases:

Using config.vectordb.strategy: redis and parameters in config.vectordb.redis:
- Redis with Vector Similarity Search (VSS)
- Redis Cloud
- Valkey v3.14+: When you configure vectordb.strategy: redis, Kong Gateway queries the server and checks the server name field. If it detects Valkey request, it automatically uses the Valkey-specific driver.
- Managed Redis with cloud authentication:
  - AWS ElastiCache (auth_provider: aws)
  - Azure Managed Redis (auth_provider: azure)
  - Google Cloud Memorystore (auth_provider: gcp)
  For configuration details, see Using cloud authentication with Redis.
Using config.vectordb.strategy: pgvector and parameters in config.vectordb.pgvector:
- PostgreSQL with pgvector v3.10+

To learn more about vector databases in AI Gateway, see Embedding-based similarity matching in Kong AI gateway plugins.

Using cloud authentication with Redis v3.13+

If your plugin uses a Redis datastore, you can authenticate to it with a cloud Redis provider. This allows you to seamlessly rotate credentials without relying on static passwords.

The following providers are supported:

AWS ElastiCache
Azure Managed Redis
Google Cloud Memorystore (with or without Valkey)

You need:

A running Redis instance on an AWS ElastiCache instance for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
The ElastiCache user needs to set “Authentication mode” to “IAM”

The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "elasticache:Connect"
 ],
 "Resource": [
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE",
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER"
 ]
 }
 ]
}

Copied!

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 host: $INSTANCE_ADDRESS
 username: $INSTANCE_USERNAME
 port: 6379
 cloud_authentication:
 auth_provider: aws
 aws_cache_name: $AWS_CACHE_NAME
 aws_is_serverless: false
 aws_region: $AWS_REGION
 aws_access_key_id: $AWS_ACCESS_KEY_ID
 aws_secret_access_key: $AWS_ACCESS_SECRET_KEY

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The ElastiCache instance address.
$INSTANCE_USERNAME: The ElastiCache username with IAM Auth mode configured.
$AWS_CACHE_NAME: Name of your AWS ElastiCache instance.
$AWS_REGION: Your AWS ElastiCache instance region.
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID.
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.

You need:

A running Redis instance on an AWS ElastiCache cluster for Valkey 7.2 or later or ElastiCache for Redis OSS version 7.0 or later
The ElastiCache user needs to set “Authentication mode” to “IAM”

The following policy assigned to the IAM user/IAM role that is used to connect to the ElastiCache:

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "elasticache:Connect"
 ],
 "Resource": [
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE",
 "arn:aws:elasticache:ARN_OF_THE_ELASTICACHE_USER"
 ]
 }
 ]
}

Copied!

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 cluster_nodes:
 - ip: $CLUSTER_ADDRESS
 port: 6379
 username: $CLUSTER_USERNAME
 port: 6379
 cloud_authentication:
 auth_provider: aws
 aws_cache_name: $AWS_CACHE_NAME
 aws_is_serverless: false
 aws_region: $AWS_REGION
 aws_access_key_id: $AWS_ACCESS_KEY_ID
 aws_secret_access_key: $AWS_ACCESS_SECRET_KEY

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The ElastiCache cluster address.
$CLUSTER_USERNAME: The ElastiCache username with IAM Auth mode configured.
$AWS_CACHE_NAME: Name of your AWS ElastiCache cluster.
$AWS_REGION: Your AWS ElastiCache cluster region.
$AWS_ACCESS_KEY_ID: (Optional) Your AWS access key ID.
$AWS_ACCESS_SECRET_KEY: (Optional) Your AWS secret access key.

You need:

A running Redis instance on an Azure Managed Redis instance with Entra authentication configured
Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 host: $INSTANCE_ADDRESS
 username: $INSTANCE_USERNAME
 port: 10000
 cloud_authentication:
 auth_provider: azure
 azure_client_id: $AZURE_CLIENT_ID
 azure_client_secret: $AZURE_CLIENT_SECRET
 azure_tenant_id: $AZURE_TENANT_ID

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The Azure Managed Redis instance address.
$INSTANCE_USERNAME: The object (principal) ID of the Principal/Identity with essential access.
$AZURE_CLIENT_ID: The client ID of the Principal/Identity.
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity.
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.

You need:

A running Redis instance on an Azure Managed Redis cluster with Entra authentication configured
Add the user/service principal/identity to the “Microsoft Entra Authentication Redis user” list for the Azure Managed Redis instance

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 cluster_nodes:
 - ip: $CLUSTER_ADDRESS
 port: 10000
 username: $CLUSTER_USERNAME
 port: 10000
 cloud_authentication:
 auth_provider: azure
 azure_client_id: $AZURE_CLIENT_ID
 azure_client_secret: $AZURE_CLIENT_SECRET
 azure_tenant_id: $AZURE_TENANT_ID

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The Azure Managed Redis cluster address.
$CLUSTER_USERNAME: The object (principal) ID of the Principal/Identity with essential access.
$AZURE_CLIENT_ID: The client ID of the Principal/Identity.
$AZURE_CLIENT_SECRET: (Optional) The client secret of the Principal/Identity.
$AZURE_TENANT_ID: (Optional) The tenant ID of the Principal/Identity.

You need:

A running Redis instance on an Google Cloud Memorystore instance
Assign the principal to the corresponding role:
- Cloud Memorystore Redis DB Connection User(roles/redis.dbConnectionUser) for Memorystore for Redis Cluster
- Memorystore DB Connector User (roles/memorystore.dbConnectionUser) for Memorystore for Valkey

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 host: $INSTANCE_ADDRESS
 port: 6379
 cloud_authentication:
 auth_provider: gcp
 gcp_service_account_json: $GCP_SERVICE_ACCOUNT

Copied!

Replace the following with your actual values:

$INSTANCE_ADDRESS: The Memorystore instance address.
$GCP_SERVICE_ACCOUNT: (Optional) The GCP service account JSON.

You need:

A running Redis instance on an Google Cloud Memorystore cluster
Assign the principal to the corresponding role:
- Cloud Memorystore Redis DB Connection User(roles/redis.dbConnectionUser) for Memorystore for Redis Cluster
- Memorystore DB Connector User (roles/memorystore.dbConnectionUser) for Memorystore for Valkey

To configure cloud authentication with Redis, add the following parameters to your plugin configuration:

config:
 vectordb:
 strategy: redis
 redis:
 cluster_nodes:
 - ip: $CLUSTER_ADDRESS
 port: 6379
 port: 6379
 cloud_authentication:
 auth_provider: gcp
 gcp_service_account_json: $GCP_SERVICE_ACCOUNT

Copied!

Replace the following with your actual values:

$CLUSTER_ADDRESS: The Memorystore cluster address.
$GCP_SERVICE_ACCOUNT: The GCP service account JSON.

FAQs

How do I resolve the MemoryDB error Number of indexes exceeds the limit?

If you see the following error in the logs:

failed to create memorydb instance failed to create index: LIMIT Number of indexes (11) exceeds the limit (10)

Copied!

This means that the hardcoded MemoryDB instance limit has been reached. To resolve this, create more MemoryDB instances to handle multiple AI Semantic Cache plugin instances.

URL: https://developer.konghq.com/plugins/ai-semantic-cache/