gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2
Model Overview
- Model Architecture: GptOssForCausalLM
- Input: Text
- Output: Text
- Source Model: gpt-oss-20b-BF16
- Supported Hardware: AMD EPYC (CPU inference)
- Preferred Operating System: Linux
- Inference Engine: vLLM v0.22.0
- Quantization Framework: LLM Compressor v0.10.0.2
- Quantization Method: 8-bit Weight, 8-bit Dynamic Activation Quantization (W8A8)
- Compatible Stack:
- ZenDNN v6.0.0
- PyTorch v2.11
- ZenTorch v2.11.0.1
- vLLM v0.22.0
This is a quantized version of gpt-oss-20b-BF16 created by AMD using LLM Compressor (compressed-tensors) for ZenDNN-optimized CPU inference.
Quantization
The model was quantized from gpt-oss-20b-BF16 using LLM Compressor via the Round-to-Nearest (RTN) algorithm. This reduces the model weights from 40 GiB to 21 GiB on disk (~47% reduction).
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modeling.gpt_oss import convert_model_for_quantization_gptoss
model_id = "unsloth/gpt-oss-20b-BF16"
output_dir = "./gpt-oss-20b-BF16-w8a8"
# Step 1: Load the BF16 model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cpu",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Step 2: Expand fused gpt-oss MoE experts into individual nn.Linear modules
# so QuantizationModifier quantizes every expert.
convert_model_for_quantization_gptoss(model)
# Step 3: Define the W8A8 recipe
recipe = QuantizationModifier(
scheme="W8A8",
targets=["Linear"],
ignore=[
"lm_head",
r"re:.*\.router$",
r"re:.*\.router\..*",
r"re:.*\.gate$",
r"re:.*\.mlp\.gate$",
],
)
# Step 4: One-shot quantize and save in compressed-tensors format
oneshot(
model=model,
recipe=recipe,
tokenizer=tokenizer,
output_dir=output_dir,
trust_remote_code_model=True,
)
# Smoke test
inputs = tokenizer("What are we having for dinner?", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Quick Start
Use with vLLM
from vllm import LLM, SamplingParams
model = LLM(
model="amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2",
dtype="bfloat16",
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = model.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)
Requirements
torch==2.11
zentorch==2.11.0.1
vllm==0.22.0
llmcompressor==0.10.0.2
OpenMP Setup
For optimal performance, set LD_PRELOAD with libomp.so (LLVM OpenMP) or libiomp5.so (Intel OpenMP):
# Using LLVM OpenMP (llvmopenmp)
export LD_PRELOAD=$(find /path/to/env -name "libomp.so" | head -1)
# Or using Intel OpenMP (libiomp)
export LD_PRELOAD=$(find /path/to/env -name "libiomp5.so" | head -1)
Note: Set
LD_PRELOADbefore launching vLLM or any inference script.
Evaluation
The model was evaluated against the BF16 (unquantized) baseline on standard benchmarks using lm-evaluation-harness with the vLLM engine.
| Benchmark | BF16 Baseline | W8A8 (this model) | Recovery |
|---|---|---|---|
| GSM8K (5-shot) | 0.8825 | 0.8832 | 100.08% |
Evaluation Command
lm_eval \
--model vllm \
--model_args pretrained=amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2,dtype=bfloat16,max_model_len=4096 \
--tasks gsm8k \
--batch_size auto \
--trust_remote_code \
--num_fewshot 5 \
--apply_chat_template \
--log_samples \
--gen_kwargs "max_gen_toks=2048" \
--output_path .
Limitations
- Version Lock: This model is compatible with ZenTorch v2.11.0.1 / PyTorch v2.11. It may not load correctly on other versions.
- CPU Only: This model is optimized for AMD EPYC CPU inference via ZenDNN. It is not intended for GPU inference.
License
This model is distributed under the same license as the source model. See the LICENSE file for details.
Modifications copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 468
