gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2

Model Overview

Model Architecture: GptOssForCausalLM
- Input: Text
- Output: Text
Source Model: gpt-oss-20b-BF16
Supported Hardware: AMD EPYC (CPU inference)
Preferred Operating System: Linux
Inference Engine: vLLM v0.22.0
Quantization Framework: LLM Compressor v0.10.0.2
Quantization Method: 8-bit Weight, 8-bit Dynamic Activation Quantization (W8A8)
Compatible Stack:
- ZenDNN v6.0.0
- PyTorch v2.11
- ZenTorch v2.11.0.1
- vLLM v0.22.0

This is a quantized version of gpt-oss-20b-BF16 created by AMD using LLM Compressor (compressed-tensors) for ZenDNN-optimized CPU inference.

Quantization

The model was quantized from gpt-oss-20b-BF16 using LLM Compressor via the Round-to-Nearest (RTN) algorithm. This reduces the model weights from 40 GiB to 21 GiB on disk (~47% reduction).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modeling.gpt_oss import convert_model_for_quantization_gptoss

model_id = "unsloth/gpt-oss-20b-BF16"
output_dir = "./gpt-oss-20b-BF16-w8a8"

# Step 1: Load the BF16 model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 torch_dtype=torch.bfloat16,
 device_map="cpu",
 trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Step 2: Expand fused gpt-oss MoE experts into individual nn.Linear modules
# so QuantizationModifier quantizes every expert.
convert_model_for_quantization_gptoss(model)

# Step 3: Define the W8A8 recipe
recipe = QuantizationModifier(
 scheme="W8A8",
 targets=["Linear"],
 ignore=[
 "lm_head",
 r"re:.*\.router$",
 r"re:.*\.router\..*",
 r"re:.*\.gate$",
 r"re:.*\.mlp\.gate$",
 ],
)

# Step 4: One-shot quantize and save in compressed-tensors format
oneshot(
 model=model,
 recipe=recipe,
 tokenizer=tokenizer,
 output_dir=output_dir,
 trust_remote_code_model=True,
)

# Smoke test
inputs = tokenizer("What are we having for dinner?", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Quick Start

Use with vLLM

from vllm import LLM, SamplingParams

model = LLM(
 model="amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2",
 dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = model.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)

Requirements

torch==2.11
zentorch==2.11.0.1
vllm==0.22.0
llmcompressor==0.10.0.2

OpenMP Setup

For optimal performance, set LD_PRELOAD with libomp.so (LLVM OpenMP) or libiomp5.so (Intel OpenMP):

# Using LLVM OpenMP (llvmopenmp)
export LD_PRELOAD=$(find /path/to/env -name "libomp.so" | head -1)

# Or using Intel OpenMP (libiomp)
export LD_PRELOAD=$(find /path/to/env -name "libiomp5.so" | head -1)

Note: Set LD_PRELOAD before launching vLLM or any inference script.

Evaluation

The model was evaluated against the BF16 (unquantized) baseline on standard benchmarks using lm-evaluation-harness with the vLLM engine.

Benchmark	BF16 Baseline	W8A8 (this model)	Recovery
GSM8K (5-shot)	0.8825	0.8832	100.08%

Evaluation Command

lm_eval \
 --model vllm \
 --model_args pretrained=amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2,dtype=bfloat16,max_model_len=4096 \
 --tasks gsm8k \
 --batch_size auto \
 --trust_remote_code \
 --num_fewshot 5 \
 --apply_chat_template \
 --log_samples \
 --gen_kwargs "max_gen_toks=2048" \
 --output_path .

Limitations

Version Lock: This model is compatible with ZenTorch v2.11.0.1 / PyTorch v2.11. It may not load correctly on other versions.
CPU Only: This model is optimized for AMD EPYC CPU inference via ZenDNN. It is not intended for GPU inference.

License

This model is distributed under the same license as the source model. See the LICENSE file for details.

Downloads last month: 468

Safetensors

Model size

21B params

Tensor type

BF16

Model tree for amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2

Base model

openai/gpt-oss-20b

Finetuned

unsloth/gpt-oss-20b-BF16

Quantized

(11)

this model

Collection including amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2

LLM-Compressor v0.10.0.2 quantized models for AMD EPYC CPU inference • 3 items • Updated about 16 hours ago

URL: https://huggingface.co/amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2

⇱ amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2 · Hugging Face