VOOZH about

URL: https://huggingface.co/RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8

⇱ RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8 · Hugging Face


Qwen2.5-VL-7B-Instruct-quantized-w8a8

Model Overview

  • Model Architecture: Qwen/Qwen2.5-VL-7B-Instruct
    • Input: Text/Image/Video
    • Output: Text
  • Model Optimizations:
    • Weight quantization: INT8
    • Activation quantization: INT8
  • Release Date: 2/24/2025
  • Version: 1.0
  • Model Developers: Neural Magic

Quantized version of Qwen/Qwen2.5-VL-7B-Instruct.

Model Optimizations

This model was obtained by quantizing the weights of Qwen/Qwen2.5-VL-7B-Instruct to INT8 data type, ready for inference with vLLM >= 0.5.2.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm.assets.image import ImageAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
 model="neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8",
 trust_remote_code=True,
 max_model_len=4096,
 max_num_seqs=2,
)

# prepare inputs
question = "What is the content of this image?"
inputs = {
 "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
 "multi_modal_data": {
 "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
 },
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
print(f"PROMPT : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog.

Evaluation

The model was evaluated using mistral-evals for vision-related tasks and using lm_evaluation_harness for select text-based benchmarks. The evaluations were conducted using the following commands:

Accuracy

Category Metric Qwen/Qwen2.5-VL-7B-Instruct Qwen2.5-VL-7B-Instruct-quantized.w8a8 Recovery (%)
Vision MMMU (val, CoT)
explicit_prompt_relaxed_correctness
52.00 52.33 100.63%
VQAv2 (val)
vqa_match
75.59 75.46 99.83%
DocVQA (val)
anls
94.27 94.09 99.81%
ChartQA (test, CoT)
anywhere_in_answer_relaxed_correctness
86.44 86.16 99.68%
Mathvista (testmini, CoT)
explicit_prompt_relaxed_correctness
69.47 70.47 101.44%
Average Score 75.95 75.90 99.93%
Text MGSM (CoT) 56.38 55.13 97.78%
MMLU (5-shot) 71.09 70.57 99.27%

Inference Performance

This model achieves up to 1.56x speedup in single-stream deployment and 1.5x in multi-stream deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.

Single-stream performance (measured with vLLM version 0.7.2)

Document Visual Question Answering
1680W x 2240H
64/128
Visual Reasoning
640W x 480H
128/128
Image Captioning
480W x 360H
0/128
Hardware Model Average Cost Reduction Latency (s) Queries Per Dollar Latency (s)th> Queries Per Dollar Latency (s) Queries Per Dollar
A6000x1 Qwen/Qwen2.5-VL-7B-Instruct 4.9 912 3.2 1386 3.1 1431
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.50 3.6 1248 2.1 2163 2.0 2237
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 2.05 3.3 1351 1.4 3252 1.4 3321
A100x1 Qwen/Qwen2.5-VL-7B-Instruct 2.8 707 1.7 1162 1.7 1198
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.24 2.4 851 1.4 1454 1.3 1512
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.49 2.2 912 1.1 1791 1.0 1950
H100x1 Qwen/Qwen2.5-VL-7B-Instruct 2.0 557 1.2 919 1.2 941
neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic 1.28 1.6 698 0.9 1181 0.9 1219
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.28 1.6 686 0.9 1191 0.9 1228

**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens

**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

Multi-stream asynchronous performance (measured with vLLM version 0.7.2)

Document Visual Question Answering
1680W x 2240H
64/128
Visual Reasoning
640W x 480H
128/128
Image Captioning
480W x 360H
0/128
Hardware Model Average Cost Reduction Maximum throughput (QPS) Queries Per Dollar Maximum throughput (QPS) Queries Per Dollar Maximum throughput (QPS) Queries Per Dollar
A6000x1 Qwen/Qwen2.5-VL-7B-Instruct 0.4 1837 1.5 6846 1.7 7638
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.41 0.5 2297 2.3 10137 2.5 11472
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.60 0.4 1828 2.7 12254 3.4 15477
A100x1 Qwen/Qwen2.5-VL-7B-Instruct 0.7 1347 2.6 5221 3.0 6122
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1.27 0.8 1639 3.4 6851 3.9 7918
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.21 0.7 1314 3.0 5983 4.6 9206
H100x1 Qwen/Qwen2.5-VL-7B-Instruct 0.9 969 3.1 3358 3.3 3615
neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic 1.29 1.2 1331 3.8 4109 4.2 4598
neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16 1.28 1.2 1298 3.8 4190 4.2 4573

**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens

**QPS: Queries per second.

**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

Downloads last month
2,873
Safetensors
Model size
8B params
Tensor type
BF16
·
I8
·

Model tree for RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8

Quantized
(142)
this model

Space using RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1

Collection including RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8