VOOZH about

URL: https://huggingface.co/ngocbh/DBTrimKV-Qwen3-VL-8B-Thinking

⇱ ngocbh/DBTrimKV-Qwen3-VL-8B-Thinking · Hugging Face


DBTrimKV is the dynamic-budget variant of TrimKV: a single global KV budget is shared across layers and heads and reallocated on the fly, with the retention-gate's final projection tied across layers.

This repository hosts the DBTrimKV retention-gate weights for Qwen/Qwen3-VL-8B-Thinking (M = 32). The base-model weights are not included — they are loaded from Qwen/Qwen3-VL-8B-Thinking at runtime and the retention-gate weights from trimkv_weights.pth are overlaid on top.

This model was introduced in the paper Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction.

👁 Paper

For the full list of released checkpoints, training recipes, and benchmark scripts, see the GitHub repository: https://github.com/ngocbh/trimkv.

Quick start

import torch
from trimkv.models.qwen3_vl import TrimKVQwen3VLForConditionalGeneration
from trimkv.cache_utils import PagedTrimKVCache
from transformers import AutoTokenizer

model = TrimKVQwen3VLForConditionalGeneration.from_pretrained(
 "ngocbh/DBTrimKV-Qwen3-VL-8B-Thinking",
 torch_dtype=torch.bfloat16,
 load_trimkv_weights=True,
 download_from="huggingface",
 use_cache=True,
 device_map="cuda",
)
model.config._attn_implementation = "flash_attention_2"

tokenizer = AutoTokenizer.from_pretrained(
 model.config.base_model, use_fast=True, padding_side="left"
)

past_key_values = PagedTrimKVCache(
 num_layers=model.config.text_config.num_hidden_layers,
 num_heads=model.config.text_config.num_key_value_heads,
 max_seq_len=32768,
 memory_size=32,
 num_blocks_ratio=1.0,
 buffer_size=32,
 strategy="fixed_budget",
 device="cuda",
)

# Use as a normal HF model — pass `past_key_values=past_key_values` to .generate

See examples/test_qwen3.py in the GitHub repo for a full runnable example.

Training details

  • Base model: Qwen/Qwen3-VL-8B-Thinking
  • Variant: DBTrimKV
  • Training dataset: Fancy-MLLM/R1-Onevision, lmms-lab/M4-Instruct-Data, lmms-lab/LLaVA-Video-178K, laolao77/MMDU, open-r1/OpenR1-Math-220k
  • Training memory size M: 32
  • Loss: fwkl_ntp

Citation

@article{bui2025cache,
 title={Cache what lasts: Token retention for memory-bounded kv cache in llms},
 author={Bui, Ngoc and Sharma, Shubham and Lamba, Simran and Mishra, Saumitra and Ying, Rex},
 journal={arXiv preprint arXiv:2512.03324},
 year={2025}
}
@article{bui2025make,
 title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
 author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
 journal={arXiv preprint arXiv:2512.03324},
 year={2025}
}
Downloads last month
4

Model tree for ngocbh/DBTrimKV-Qwen3-VL-8B-Thinking

Finetuned
(64)
this model

Datasets used to train ngocbh/DBTrimKV-Qwen3-VL-8B-Thinking

Collection including ngocbh/DBTrimKV-Qwen3-VL-8B-Thinking

Papers for ngocbh/DBTrimKV-Qwen3-VL-8B-Thinking