AI Prompt Compressor

AI License Required

Why use prompt compression

Efficient prompt compression helps you manage token limits, cut costs, and speed up LLM requests — all while keeping sensitive data safe and your prompts focused.

The table below outlines common use cases for the plugin and the configuration options available to tailor its behavior.

Use case	Description
Token limit management	Compress verbose inputs like chat history or documents to stay within the LLM’s context window. Prevents truncation of important content.
Cost reduction	Reducing token count in prompts decreases API costs when calling large language models, especially for high-volume use cases.
Latency reduction	Smaller prompts result in faster request/response cycles, improving performance for real-time applications like voice assistants.
Data privacy	Compress or abstract sensitive or personally identifiable information to maintain privacy and comply with data protection standards.
Dynamic prompt optimization	Automatically strip verbose or low-value content before sending to the LLM, keeping the focus on what’s most relevant.

Kong provides a Docker image for the AI Prompt Compressor service, which compresses LLM prompts before sending them upstream. It uses LLMLingua 2 to reduce prompt size, which helps you manage token limits and maintain context fidelity. The service supports both HTTP and JSON-RPC APIs and is designed to work with the AI Prompt Compressor plugin in AI Gateway.

Kong provides Compressor service as a private Docker image in a Cloudsmith repository. Contact Kong Support to get access to it.

Once you’ve received your Cloudsmith access token, run the following commands in Docker to pull the image:

To pull images, you must authenticate first with the token provided by the Support:
```
 docker login docker.cloudsmith.io
```
Copied!
Docker will then prompt you to enter username and password:
```
 Username: kong/ai-compress
 Password: <YOUR_TOKEN>
```
Copied!
This is a token-based login with read-only access. You can pull images but not push them. Contact support for your token.
To pull an image:

Replace <image-name> and <tag> with the appropriate image and version, such as:
```
 docker pull docker.cloudsmith.io/kong/ai-compress/service:v0.0.3
```
Copied!

You can now run the image by pasting the following command in Docker:

 docker run --rm -p 8080:8080 docker.cloudsmith.io/kong/ai-compress/service:v0.0.3

Copied!

Image configuration options

You can configure the Kong Compressor Service using environment variables. These affect model selection, hardware usage, logging, and worker behavior.

Configuration option	Description
LLMLINGUA_MODEL_NAME	Specifies the LLMLingua 2 model to use for compression. Defaults to `microsoft/llmlingua-2-xlm-roberta-large-meetingbank`.
LLMLINGUA_DEVICE_MAP	Device on which to run the model. Supported values include `cpu`, `cuda`, `auto`, or `mps`.
LLMLINGUA_LOG_LEVEL	Log level for the LLMLingua compression logic. Set to `info`, `debug`, or `warning` based on your needs.
GUNICORN_WORKERS	Number of Gunicorn worker processes (for Docker deployments only). Defaults to `2`.
GUNICORN_LOG_LEVEL	Log level for Gunicorn server output (for Docker deployments only). Defaults to `info`.

Compression endpoints

The compressor service exposes both REST and JSON-RPC endpoints. You can use these interfaces to compress prompts, check the current status, or integrate with upstream services and plugins.

POST /llm/v1/compressPrompt: Compresses a prompt using either a compression ratio or a target token count. Supports selective compression via <LLMLINGUA> tags.
GET /status: Returns information about the currently loaded LLMLingua model and device settings (for example, CPU or GPU).
POST /: JSON-RPC endpoint that supports the llm.v1.compressPrompt method. Use this to invoke compression programmatically over JSON-RPC.

See the AI Prompt Compressor OpenAPI specification for complete details.

Prompt compression options

The AI Prompt Compressor plugin offers flexible compression controls to fit different use cases. You can choose between full-prompt compression, conditional strategies, or selectively compressing only parts of the prompt:

Configuration Option	Description
Compression by ratio	Compress the prompt to a percentage of its original length (for example, reduce to 80%). This allows for consistent shrinkage regardless of the initial size.
Compression by token count	Compress the prompt to a specific token target (for example, 150 tokens). Useful when working close to LLM context window limits.
Conditional rules	Apply different compression strategies based on prompt length. For example, compress prompts under 100 tokens using a 0.8 ratio, and compress longer prompts to a fixed token count.
Selective compression with tags	Wrap sections of the prompt in `<LLMLINGUA>...</LLMLINGUA>` to target only specific parts for compression, preserving untagged content as-is.

How it works

The user sends the final prompt to the AI Prompt Compressor plugin.
The plugin checks the prompt for <LLMLINGUA>…</LLMLINGUA> tags.
- If tags are found, only the tagged sections are sent to LLMLingua 2 for compression.
- If no tags are found, the entire prompt is sent to LLMLingua 2 for compression.
Compression is applied based on configured rules—by ratio, target token count, or conditional length-based rules.
The compressed prompt is returned to the plugin.
The plugin sends the compressed prompt to the Large Language Model (LLM).
The LLM processes the prompt and returns the response to the user.

The diagram below illustrates how the AI Prompt Compressor plugin processes and compresses incoming prompts based on tagging and configured rules.

 
sequenceDiagram
 actor User
 participant KongAICompressor as AI Prompt Compressor Plugin
 participant LLMLingua2 as LLMLingua 2 Compressor
 participant LLM as Large Language Model

 User->>KongAICompressor: Sends final prompt
 activate KongAICompressor
 KongAICompressor->>KongAICompressor: Check for LLMLINGUA tags

 alt If tagged content found
 KongAICompressor->>LLMLingua2: Compress tagged sections
 activate LLMLingua2
 LLMLingua2-->>KongAICompressor: Return compressed sections
 deactivate LLMLingua2
 else If no LLMlingua tags
 KongAICompressor->>LLMLingua2: Compress entire prompt
 activate LLMLingua2
 LLMLingua2-->>KongAICompressor: Return compressed prompt
 deactivate LLMLingua2
 end

 KongAICompressor->>LLM: Send compressed prompt
 deactivate KongAICompressor
 activate LLM
 LLM-->>User: Return response
 deactivate LLM

The AI Prompt Compressor plugin applies structured compression to preserve essential context of prompts sent by users, rather than trimming prompts arbitrarily or risking token overflows. This ensures the LLM receives a well-formed, focused prompt keeping token usage under control.

URL: https://developer.konghq.com/plugins/ai-prompt-compressor/