Qwen3-Next-80B-A3B-Instruct-quantized.w4a16
👁 Model Icon
👁 Validated BadgeModel Overview
- Model Architecture: Qwen3NextForCausalLM
- Input: Text
- Output: Text
- Model Optimizations:
- Weight quantization: INT4
- Version: 1.0
- Model Developers: RedHat (Neural Magic)
- ModelCar Storage URI: oci://registry.redhat.io/rhai/modelcar-qwen3-next-80b-a3b-instruct-quantized-w4a16:3.0
- Validated on vLLM: 0.13.0
- Validated on RHAIIS: 3.3
- Validated on RHOAI: 3.3
Model Optimizations
This model was obtained by quantizing the weights of Qwen/Qwen3-Next-80B-A3B-Instruct to INT4 data type. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
Only the weights of the linear operators within transformers blocks are quantized. Weights are quantized using a symmetric per-group scheme, with group size 128. The GPTQ algorithm is applied for quantization, as implemented in the llm-compressor library.
Deployment
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, top_k=20, min_p=0, max_tokens=256)
messages = [
{"role": "user", "content": prompt}
]
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{"role": "user", "content": "Give me a short introduction to large language model."}]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM aslo supports OpenAI-compatible serving. See the documentation for more details.
Creation
Evaluation
The model was evaluated on the OpenLLM leaderboard tasks versions 2, using lm-evaluation-harness, and on reasoning tasks using lighteval. vLLM was used for all evaluations.
Accuracy
| Category | Metric | Qwen/Qwen3-Next-80B-A3B-Instruct | RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16 | Recovery (%) |
|---|---|---|---|---|
| OpenLLM V1 | ARC-Challenge (Acc-Norm, 25-shot) | 73.29 | 72.70 | 99.19 |
| GSM8K (Strict-Match, 5-shot) | 81.58 | 82.18 | 100.74 | |
| HellaSwag (Acc-Norm, 10-shot) | 63.90 | 63.64 | 99.59 | |
| MMLU (Acc, 5-shot) | 85.56 | 85.03 | 99.38 | |
| TruthfulQA (MC2, 0-shot) | 60.70 | 60.63 | 99.88 | |
| Winogrande (Acc, 5-shot) | 78.30 | 78.37 | 100.09 | |
| Average Score | 73.89 | 73.76 | 99.82 | |
| OpenLLM V2 | IFEval (Inst Level Strict Acc, 0-shot) | 77.46 | 80.70 | 104.18 |
| BBH (Acc-Norm, 3-shot) | 67.78 | 67.33 | 99.34 | |
| Math-Hard (Exact-Match, 4-shot) | 56.04 | 55.36 | 98.79 | |
| GPQA (Acc-Norm, 0-shot) | 28.61 | 28.61 | 100.00 | |
| MUSR (Acc-Norm, 0-shot) | 39.68 | 40.08 | 101.01 | |
| MMLU-Pro (Acc, 5-shot) | 76.35 | 75.48 | 98.86 | |
| Average Score | 57.65 | 57.93 | 100.49 |
- Downloads last month
- 267
Model tree for RedHatAI/Qwen3-Next-80B-A3B-Instruct-quantized.w4a16
Base model
Qwen/Qwen3-Next-80B-A3B-Instruct