Qwen2.5-VL-7B-Instruct-quantized-w8a8

Model Overview

Model Architecture: Qwen/Qwen2.5-VL-7B-Instruct
- Input: Text/Image/Video
- Output: Text
Model Optimizations:
- Weight quantization: INT8
- Activation quantization: INT8
Release Date: 2/24/2025
Version: 1.0
Model Developers: Neural Magic

Quantized version of Qwen/Qwen2.5-VL-7B-Instruct.

Model Optimizations

This model was obtained by quantizing the weights of Qwen/Qwen2.5-VL-7B-Instruct to INT8 data type, ready for inference with vLLM >= 0.5.2.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm.assets.image import ImageAsset
from vllm import LLM, SamplingParams

# prepare model
llm = LLM(
 model="neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8",
 trust_remote_code=True,
 max_model_len=4096,
 max_num_seqs=2,
)

# prepare inputs
question = "What is the content of this image?"
inputs = {
 "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
 "multi_modal_data": {
 "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
 },
}

# generate response
print("========== SAMPLE GENERATION ==============")
outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
print(f"PROMPT : {outputs[0].prompt}")
print(f"RESPONSE: {outputs[0].outputs[0].text}")
print("==========================================")

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created with llm-compressor by running the code snippet below as part a multimodal announcement blog.

Evaluation

The model was evaluated using mistral-evals for vision-related tasks and using lm_evaluation_harness for select text-based benchmarks. The evaluations were conducted using the following commands:

Accuracy

Category	Metric	Qwen/Qwen2.5-VL-7B-Instruct	Qwen2.5-VL-7B-Instruct-quantized.w8a8	Recovery (%)
Vision	MMMU (val, CoT) explicit_prompt_relaxed_correctness	52.00	52.33	100.63%
	VQAv2 (val) vqa_match	75.59	75.46	99.83%
	DocVQA (val) anls	94.27	94.09	99.81%
	ChartQA (test, CoT) anywhere_in_answer_relaxed_correctness	86.44	86.16	99.68%
	Mathvista (testmini, CoT) explicit_prompt_relaxed_correctness	69.47	70.47	101.44%
	Average Score	75.95	75.90	99.93%
Text	MGSM (CoT)	56.38	55.13	97.78%
	MMLU (5-shot)	71.09	70.57	99.27%

Inference Performance

This model achieves up to 1.56x speedup in single-stream deployment and 1.5x in multi-stream deployment, depending on hardware and use-case scenario. The following performance benchmarks were conducted with vLLM version 0.7.2, and GuideLLM.

Single-stream performance (measured with vLLM version 0.7.2)

			Document Visual Question Answering 1680W x 2240H 64/128		Visual Reasoning 640W x 480H 128/128		Image Captioning 480W x 360H 0/128
Hardware	Model	Average Cost Reduction	Latency (s)	Queries Per Dollar	Latency (s)th>	Queries Per Dollar	Latency (s)	Queries Per Dollar
A6000x1	Qwen/Qwen2.5-VL-7B-Instruct	4.9	912	3.2	1386	3.1	1431
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8	1.50	3.6	1248	2.1	2163	2.0	2237
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16	2.05	3.3	1351	1.4	3252	1.4	3321
A100x1	Qwen/Qwen2.5-VL-7B-Instruct	2.8	707	1.7	1162	1.7	1198
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8	1.24	2.4	851	1.4	1454	1.3	1512
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16	1.49	2.2	912	1.1	1791	1.0	1950
H100x1	Qwen/Qwen2.5-VL-7B-Instruct	2.0	557	1.2	919	1.2	941
	neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic	1.28	1.6	698	0.9	1181	0.9	1219
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16	1.28	1.6	686	0.9	1191	0.9	1228

**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens

**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

Multi-stream asynchronous performance (measured with vLLM version 0.7.2)

			Document Visual Question Answering 1680W x 2240H 64/128		Visual Reasoning 640W x 480H 128/128		Image Captioning 480W x 360H 0/128
Hardware	Model	Average Cost Reduction	Maximum throughput (QPS)	Queries Per Dollar	Maximum throughput (QPS)	Queries Per Dollar	Maximum throughput (QPS)	Queries Per Dollar
A6000x1	Qwen/Qwen2.5-VL-7B-Instruct	0.4	1837	1.5	6846	1.7	7638
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8	1.41	0.5	2297	2.3	10137	2.5	11472
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16	1.60	0.4	1828	2.7	12254	3.4	15477
A100x1	Qwen/Qwen2.5-VL-7B-Instruct	0.7	1347	2.6	5221	3.0	6122
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w8a8	1.27	0.8	1639	3.4	6851	3.9	7918
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16	1.21	0.7	1314	3.0	5983	4.6	9206
H100x1	Qwen/Qwen2.5-VL-7B-Instruct	0.9	969	3.1	3358	3.3	3615
	neuralmagic/Qwen2.5-VL-7B-Instruct-FP8-Dynamic	1.29	1.2	1331	3.8	4109	4.2	4598
	neuralmagic/Qwen2.5-VL-7B-Instruct-quantized.w4a16	1.28	1.2	1298	3.8	4190	4.2	4573

**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens

**QPS: Queries per second.

**QPD: Queries per dollar, based on on-demand cost at Lambda Labs (observed on 2/18/2025).

Downloads last month: 2,873

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Quantized

(142)

this model

Space using RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8 1

Collection including RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8

Quantized variants of Qwen 2.5 Instruct and Qwen VL models • 10 items • Updated Apr 30 • 2

URL: https://huggingface.co/RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8

⇱ RedHatAI/Qwen2.5-VL-7B-Instruct-quantized.w8a8 · Hugging Face