AI Gateway Enterprise: This plugin is only available as part of our AI Gateway Enterprise offering.
The Kong AI Prompt Compressor plugin compresses retrieved chunks before sending them to a Large Language Model (LLM), reducing text length while preserving meaning. It uses the LLMLingua 2 library for fast, high-quality compression. The plugin supports:
- Ratio-based or target token compression — for example, reduce a message to 80% of the original length or compress to 150 tokens.
- Configurable compression ranges — for example, compress prompts under 100 tokens with a 0.8 ratio or compress them to exactly 100 tokens.
-
Selective compression using
<LLMLINGUA>...</LLMLINGUA>tags to target specific sections of the prompt. These tags work only in theinject_templatefield of the AI RAG Injector plugin and must be used in combination with the AI Prompt Compressor.
Why use prompt compression
Efficient prompt compression helps you manage token limits, cut costs, and speed up LLM requests — all while keeping sensitive data safe and your prompts focused.
The table below outlines common use cases for the plugin and the configuration options available to tailor its behavior.
|
Use case |
Description |
|---|---|
| Token limit management | Compress verbose inputs like chat history or documents to stay within the LLM’s context window. Prevents truncation of important content. |
| Cost reduction | Reducing token count in prompts decreases API costs when calling large language models, especially for high-volume use cases. |
| Latency reduction | Smaller prompts result in faster request/response cycles, improving performance for real-time applications like voice assistants. |
| Data privacy | Compress or abstract sensitive or personally identifiable information to maintain privacy and comply with data protection standards. |
| Dynamic prompt optimization | Automatically strip verbose or low-value content before sending to the LLM, keeping the focus on what’s most relevant. |
AI Prompt Compression Service
Kong provides a Docker image for the AI Prompt Compressor service, which compresses LLM prompts before sending them upstream. It uses LLMLingua 2 to reduce prompt size, which helps you manage token limits and maintain context fidelity. The service supports both HTTP and JSON-RPC APIs and is designed to work with the AI Prompt Compressor plugin in AI Gateway.
Kong provides Compressor service as a private Docker image in a Cloudsmith repository. Contact Kong Support to get access to it.
Once you’ve received your Cloudsmith access token, run the following commands in Docker to pull the image:
-
To pull images, you must authenticate first with the token provided by the Support:
docker login docker.cloudsmith.ioCopied! -
Docker will then prompt you to enter username and password:
Username: kong/ai-compress Password: <YOUR_TOKEN>Copied!This is a token-based login with read-only access. You can pull images but not push them. Contact support for your token.
-
To pull an image:
Replace
<image-name>and<tag>with the appropriate image and version, such as:docker pull docker.cloudsmith.io/kong/ai-compress/service:v0.0.3Copied! -
You can now run the image by pasting the following command in Docker:
docker run --rm -p 8080:8080 docker.cloudsmith.io/kong/ai-compress/service:v0.0.3Copied!
Image configuration options
You can configure the Kong Compressor Service using environment variables. These affect model selection, hardware usage, logging, and worker behavior.
|
Configuration option |
Description |
|---|---|
| LLMLINGUA_MODEL_NAME |
Specifies the LLMLingua 2 model to use for compression. Defaults to microsoft/llmlingua-2-xlm-roberta-large-meetingbank.
|
| LLMLINGUA_DEVICE_MAP |
Device on which to run the model. Supported values include cpu, cuda, auto, or mps.
|
| LLMLINGUA_LOG_LEVEL |
Log level for the LLMLingua compression logic. Set to info, debug, or warning based on your needs.
|
| GUNICORN_WORKERS |
Number of Gunicorn worker processes (for Docker deployments only). Defaults to 2.
|
| GUNICORN_LOG_LEVEL |
Log level for Gunicorn server output (for Docker deployments only). Defaults to info.
|
Compression endpoints
The compressor service exposes both REST and JSON-RPC endpoints. You can use these interfaces to compress prompts, check the current status, or integrate with upstream services and plugins.
-
POST
/llm/v1/compressPrompt: Compresses a prompt using either a compression ratio or a target token count. Supports selective compression via<LLMLINGUA>tags. -
GET
/status: Returns information about the currently loaded LLMLingua model and device settings (for example, CPU or GPU). -
POST
/: JSON-RPC endpoint that supports thellm.v1.compressPromptmethod. Use this to invoke compression programmatically over JSON-RPC.
See the AI Prompt Compressor OpenAPI specification for complete details.
Prompt compression options
The AI Prompt Compressor plugin offers flexible compression controls to fit different use cases. You can choose between full-prompt compression, conditional strategies, or selectively compressing only parts of the prompt:
|
Configuration Option |
Description |
|---|---|
| Compression by ratio | Compress the prompt to a percentage of its original length (for example, reduce to 80%). This allows for consistent shrinkage regardless of the initial size. |
| Compression by token count | Compress the prompt to a specific token target (for example, 150 tokens). Useful when working close to LLM context window limits. |
| Conditional rules | Apply different compression strategies based on prompt length. For example, compress prompts under 100 tokens using a 0.8 ratio, and compress longer prompts to a fixed token count. |
| Selective compression with tags |
Wrap sections of the prompt in <LLMLINGUA>...</LLMLINGUA> to target only specific parts for compression, preserving untagged content as-is.
|
How it works
- The user sends the final prompt to the AI Prompt Compressor plugin.
- The plugin checks the prompt for
<LLMLINGUA>…</LLMLINGUA>tags.- If tags are found, only the tagged sections are sent to LLMLingua 2 for compression.
- If no tags are found, the entire prompt is sent to LLMLingua 2 for compression.
- Compression is applied based on configured rules—by ratio, target token count, or conditional length-based rules.
- The compressed prompt is returned to the plugin.
- The plugin sends the compressed prompt to the Large Language Model (LLM).
- The LLM processes the prompt and returns the response to the user.
The diagram below illustrates how the AI Prompt Compressor plugin processes and compresses incoming prompts based on tagging and configured rules.
sequenceDiagram actor User participant KongAICompressor as AI Prompt Compressor Plugin participant LLMLingua2 as LLMLingua 2 Compressor participant LLM as Large Language Model User->>KongAICompressor: Sends final prompt activate KongAICompressor KongAICompressor->>KongAICompressor: Check for LLMLINGUA tags alt If tagged content found KongAICompressor->>LLMLingua2: Compress tagged sections activate LLMLingua2 LLMLingua2-->>KongAICompressor: Return compressed sections deactivate LLMLingua2 else If no LLMlingua tags KongAICompressor->>LLMLingua2: Compress entire prompt activate LLMLingua2 LLMLingua2-->>KongAICompressor: Return compressed prompt deactivate LLMLingua2 end KongAICompressor->>LLM: Send compressed prompt deactivate KongAICompressor activate LLM LLM-->>User: Return response deactivate LLM
The AI Prompt Compressor plugin applies structured compression to preserve essential context of prompts sent by users, rather than trimming prompts arbitrarily or risking token overflows. This ensures the LLM receives a well-formed, focused prompt keeping token usage under control.
