PromptThrift MCP：面向 LLM 应用的智能令牌压缩工具

通过智能对话压缩，削减 70-90% 的 LLM API 成本。现已支持 Gemma 4 本地压缩：更智能的摘要，零 API 成本。

👁 License: MIT
👁 Python 3.10+
👁 MCP Compatible
👁 Gemma 4

⭐ 如果这为你节省了开支，请给本项目点个星！ ⭐

问题所在

每次 LLM API 调用都会重新发送你的整个对话历史。20 轮对话的单次调用成本是 3 轮对话的 6 倍，这意味着你一直在为重复的旧消息付费。

Turn 1: ████ 700 tokens ($0.002)
Turn 5: ████████████████ 4,300 tokens ($0.013)
Turn 20: ████████████████████████████████████████ 12,500 tokens ($0.038)
 ↑ You're paying for THIS every call

Related MCP server: SlimContext MCP Server

解决方案

PromptThrift 是一个 MCP 服务器，提供 4 种工具来大幅降低你的 API 成本：

工具	功能	影响
`promptthrift_compress_history`	将旧对话轮次压缩为智能摘要	输入令牌减少 50-90%
`promptthrift_count_tokens`	追踪 14 种模型的令牌使用量及成本	明确资金去向
`promptthrift_suggest_model`	为任务推荐最便宜的模型	简单任务节省 60-80%
`promptthrift_pin_facts`	固定关键事实，防止被压缩	永不丢失核心上下文

为什么选择 PromptThrift？

PromptThrift	Context Mode	Headroom
许可证	MIT (支持商业用途)	ELv2 (不可竞争)	Apache 2.0
压缩类型	对话记忆	工具模式虚拟化	工具输出
本地 LLM 支持	通过 Ollama 支持 Gemma 4	无	无
成本追踪	多模型对比	无	无
模型路由	内置	无	无
固定事实	永不压缩列表	无	无

快速入门

安装

选项 A：pip 安装（推荐）

pip install git+https://github.com/woling-dev/promptthrift-mcp.git

选项 B：克隆并安装

git clone https://github.com/woling-dev/promptthrift-mcp.git
cd promptthrift-mcp
pip install -e .

可选：启用 Gemma 4 压缩

获取更智能的 AI 驱动压缩（免费，本地运行）：

# Install Ollama: https://ollama.com
ollama pull gemma4:e4b

PromptThrift 会自动检测 Ollama。如果正在运行 → 使用 Gemma 4 进行压缩。如果未运行 → 回退到快速启发式压缩。无需任何配置。

Claude Desktop

添加到 claude_desktop_config.json：

{
 "mcpServers": {
 "promptthrift": {
 "command": "python",
 "args": ["/path/to/promptthrift-mcp/server.py"]
 }
 }
}

Cursor / Windsurf

添加到你的 MCP 设置中：

{
 "mcpServers": {
 "promptthrift": {
 "command": "python",
 "args": ["/path/to/promptthrift-mcp/server.py"]
 }
 }
}

实际案例

一个 AI 编程助手在 30 多轮对话中调试复杂问题：

压缩前（每次 API 调用都发送）：

User: My Next.js app throws a hydration error on the /dashboard page.
Asst: That usually means server and client HTML don't match. Can you share the component?
User: [pastes 50 lines of DashboardLayout.tsx]
Asst: I see the issue, you're using `new Date()` directly in render, which differs
 between server and client. Let me also check your data fetching...
User: I also get a warning about useEffect running twice.
Asst: That's React 18 Strict Mode. Not related to hydration. Let me trace the real bug...
User: Wait, there's also a flash of unstyled content on first load.
Asst: That's a separate CSS loading order issue. Let me address both...
 [... 25 more turns of debugging, trying fixes, checking logs ...]
User: OK it's fixed now! But I want to add dark mode next.
Asst: Great! For dark mode with Next.js + Tailwind, here are three approaches...

30 轮对话后约 8,500 个令牌，且每次 API 调用都在增长

使用 Gemma 4 压缩后：

[Compressed history]
Resolved Next.js hydration error in DashboardLayout.tsx caused by
Date() in render (fixed with useEffect). Unrelated: React 18 Strict Mode
double-fire (expected), CSS flash (fixed via loading order).
User now wants to add dark mode to Next.js + Tailwind app.
[End compressed history]

[Recent turns preserved, last 4 turns intact]

约 1,200 个令牌。后续每次调用节省 86%

规模化成本影响（Claude Sonnet @ $3/MTok）：

场景	不使用 PromptThrift	使用 PromptThrift	每月节省
1 名开发者，每天 20 次会话	$5.10/月	$0.72/月	$4.38
10 名开发者团队	$51/月	$7.20/月	$43.80
客服机器人（每天 500 次对话）	$255/月	$36/月	$219
AI 代理平台（每天 5K 次会话）	$2,550/月	$357/月	$2,193

固定事实（永不压缩列表）

有些事实在压缩过程中绝不能丢失：用户名、关键偏好、重要决策。将它们固定：

You: "Pin the fact that this customer is allergic to nuts"

→ promptthrift_pin_facts(action="add", facts=["Customer is allergic to nuts"])
→ This fact will appear in ALL future compressed summaries, guaranteed.

支持的模型（2026 年 4 月定价）

模型	输入 $/MTok	输出 $/MTok	本地？
gemma-4-e2b	$0.00	$0.00	Ollama
gemma-4-e4b	$0.00	$0.00	Ollama
gemma-4-27b	$0.00	$0.00	Ollama
gemini-2.0-flash	$0.10	$0.40
gpt-4.1-nano	$0.10	$0.40
gpt-4o-mini	$0.15	$0.60
gemini-2.5-flash	$0.15	$0.60
gpt-4.1-mini	$0.40	$1.60
claude-haiku-4.5	$1.00	$5.00
gemini-2.5-pro	$1.25	$10.00
gpt-4.1	$2.00	$8.00
gpt-4o	$2.50	$10.00
claude-sonnet-4.6	$3.00	$15.00
claude-opus-4.6	$5.00	$25.00

工作原理

Before (every API call sends ALL of this):
┌──────────────────────────────────┐
│ System prompt (500 tokens) │
│ Turn 1: user+asst (600 tokens) │ ← Repeated every call
│ Turn 2: user+asst (600 tokens) │ ← Repeated every call
│ ... │
│ Turn 8: user+asst (600 tokens) │ ← Repeated every call
│ Turn 9: user+asst (new) │
│ Turn 10: user (new) │
└──────────────────────────────────┘
Total: ~6,500 tokens per call

After PromptThrift compression:
┌──────────────────────────────────┐
│ System prompt (500 tokens) │
│ [Pinned facts] (50 tokens) │ ← Always preserved
│ [Compressed summary](200 tokens) │ ← Turns 1-8 in 200 tokens!
│ Turn 9: user+asst (kept) │
│ Turn 10: user (kept) │
└──────────────────────────────────┘
Total: ~1,750 tokens per call (73% saved!)

压缩模式

模式	方法	质量	速度	成本
启发式	基于规则的提取	良好 (50-60% 缩减)	即时	免费
LLM (Gemma 4)	AI 驱动的理解	优秀 (70-90% 缩减)	~10-15秒	免费 (本地)

PromptThrift 会自动使用最佳可用方法。安装 Ollama + Gemma 4 以获得最佳压缩质量。

压缩何时最有效？

压缩效果随对话长度和冗余度增加：

对话长度	典型缩减率	适用场景
短 (< 5 轮，多为技术性)	15-25%	节省极少：保持原样
中 (10-20 轮，混合聊天)	50-70%	最佳点：明显的成本削减
长 (30+ 轮，调试/迭代)	70-90%	大幅节省：尽早且频繁压缩

为什么？ 短而密集的对话几乎没有可删除的冗余内容。较长的对话会积累问候语、重复的上下文、探索性的死胡同和冗长的解释，而这些正是压缩器会剔除的内容。一个包含代码片段、来回排查和最终解决方案的 30 轮调试会话，在压缩后效果显著，因为对于未来的上下文而言，只有结论和关键决策才是重要的。

经验法则： 在 8-10 轮对话后开始压缩以获得最佳效果。

环境变量

变量	必需	默认值	描述
`PROMPTTHRIFT_OLLAMA_MODEL`	否	`gemma4:e4b`	用于 LLM 压缩的 Ollama 模型
`PROMPTTHRIFT_OLLAMA_URL`	否	`http://localhost:11434`	Ollama API 端点
`PROMPTTHRIFT_DEFAULT_MODEL`	否	`claude-sonnet-4.6`	用于成本估算的默认模型

安全性

默认情况下，所有数据均在本地处理。没有任何数据离开你的机器
Ollama 压缩 100% 在你的硬件上运行
压缩后清理器会从摘要中剔除提示词注入模式
API 密钥仅从环境变量读取，绝不硬编码
无持久化存储，无遥测，无第三方调用

路线图

[x] 启发式对话压缩
[x] 多模型令牌计数 (14 种模型)
[x] 智能模型路由
[x] 通过 Ollama 进行 Gemma 4 本地 LLM 压缩
[x] 固定事实（永不压缩列表）
[x] 压缩后安全清理器
[ ] 云端压缩 (Anthropic/OpenAI API 回退)
[ ] 提示词缓存优化建议
[ ] 使用分析 Web 仪表板
[ ] VS Code 扩展

贡献

欢迎提交 PR！本项目使用 MIT 许可证。Fork 它，改进它，发布它。

关于 BrandDefender.ai

BrandDefender.ai 是 Wolin Global Media (沃嶺國際媒體) 的产品线，这是一家位于台湾的 AI 基础设施工作室，致力于帮助品牌被 AI 系统发现、理解和推荐。

我们构建什么

🔍 AEO 咨询 (答案引擎优化) 让你的品牌被 ChatGPT、Gemini、Perplexity 和 Claude 正确引用。我们实施 JSON-LD 架构，优化内容结构，并为台湾食品、茶饮、美妆和生活方式品牌监控 AI 搜索表现。

网站: https://aibranddefender.com/
免费 AI 品牌扫描: https://app.aibranddefender.com/

💬 AI 客户服务 (LINE Bot) 生产级 LINE 聊天机器人，具备 3 层记忆、管理员接管和 Supabase 后端。已服务于零售和餐饮行业的真实品牌。

指南: LINE AI 聊天机器人指南

🧠 AI 记忆 MCP 基础设施 面向 Claude Code、Cursor 和 LLM 构建者的开源 MCP 服务器。本地优先，保护隐私，旨在节省 API 成本。

本仓库就是其中之一。
姊妹工具: promptforge · promptthrift-mcp

联系方式

📧 电子邮件: service@wolinglobal.com
💬 LINE: @886upktf
🌐 网站: https://aibranddefender.com/
🐙 GitHub: https://github.com/woling-dev

台湾品牌想做 AEO audit：我们提供 ChatGPT / Gemini / Perplexity 全面扫描 + JSON-LD 修补 + 月度监测。Email 或 LINE 直接找我们聊。

许可证

MIT 许可证。个人和商业用途均免费。

如果这为你节省了开支，请给本项目点个星！

This server cannot be installed

license - permissive license

quality - not tested

maintenance

How are these scores calculated?

Maintenance

–Maintainers

–Response time

–Release cycle

–Releases (12mo)

Commit activity

Resources

GitHub Repository

Need Help?

Related Servers

Appeared in Searches

A guide for reducing token count in AI requests

Latest Blog Posts

Lightport: Open-Sourcing Glama's AI Gateway
By punkpeye on April 27, 2026.
open source
OpenAI
Tool Definition Quality Score (TDQS)
By punkpeye on April 3, 2026.
mcp
The Hackers Who Tracked My Sleep Cycle
By punkpeye on March 26, 2026.
security

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/woling-dev/promptthrift-mcp'

If you have feedback or need assistance with the MCP directory API, please join our Discord server

URL: https://glama.ai/mcp/servers/woling-dev/promptthrift-mcp?locale=zh-CN

⇱ PromptThrift MCP by woling-dev | Glama