VOOZH about

URL: https://huggingface.co/amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2

⇱ amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2 · Hugging Face


gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2

Model Overview

  • Model Architecture: GptOssForCausalLM
    • Input: Text
    • Output: Text
  • Source Model: gpt-oss-20b-BF16
  • Supported Hardware: AMD EPYC (CPU inference)
  • Preferred Operating System: Linux
  • Inference Engine: vLLM v0.22.0
  • Quantization Framework: LLM Compressor v0.10.0.2
  • Quantization Method: 8-bit Weight, 8-bit Dynamic Activation Quantization (W8A8)
  • Compatible Stack:
    • ZenDNN v6.0.0
    • PyTorch v2.11
    • ZenTorch v2.11.0.1
    • vLLM v0.22.0

This is a quantized version of gpt-oss-20b-BF16 created by AMD using LLM Compressor (compressed-tensors) for ZenDNN-optimized CPU inference.

Quantization

The model was quantized from gpt-oss-20b-BF16 using LLM Compressor via the Round-to-Nearest (RTN) algorithm. This reduces the model weights from 40 GiB to 21 GiB on disk (~47% reduction).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modeling.gpt_oss import convert_model_for_quantization_gptoss

model_id = "unsloth/gpt-oss-20b-BF16"
output_dir = "./gpt-oss-20b-BF16-w8a8"

# Step 1: Load the BF16 model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 torch_dtype=torch.bfloat16,
 device_map="cpu",
 trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Step 2: Expand fused gpt-oss MoE experts into individual nn.Linear modules
# so QuantizationModifier quantizes every expert.
convert_model_for_quantization_gptoss(model)

# Step 3: Define the W8A8 recipe
recipe = QuantizationModifier(
 scheme="W8A8",
 targets=["Linear"],
 ignore=[
 "lm_head",
 r"re:.*\.router$",
 r"re:.*\.router\..*",
 r"re:.*\.gate$",
 r"re:.*\.mlp\.gate$",
 ],
)

# Step 4: One-shot quantize and save in compressed-tensors format
oneshot(
 model=model,
 recipe=recipe,
 tokenizer=tokenizer,
 output_dir=output_dir,
 trust_remote_code_model=True,
)

# Smoke test
inputs = tokenizer("What are we having for dinner?", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=30)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Quick Start

Use with vLLM

from vllm import LLM, SamplingParams

model = LLM(
 model="amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2",
 dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = model.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)

Requirements

torch==2.11
zentorch==2.11.0.1
vllm==0.22.0
llmcompressor==0.10.0.2

OpenMP Setup

For optimal performance, set LD_PRELOAD with libomp.so (LLVM OpenMP) or libiomp5.so (Intel OpenMP):

# Using LLVM OpenMP (llvmopenmp)
export LD_PRELOAD=$(find /path/to/env -name "libomp.so" | head -1)

# Or using Intel OpenMP (libiomp)
export LD_PRELOAD=$(find /path/to/env -name "libiomp5.so" | head -1)

Note: Set LD_PRELOAD before launching vLLM or any inference script.

Evaluation

The model was evaluated against the BF16 (unquantized) baseline on standard benchmarks using lm-evaluation-harness with the vLLM engine.

Benchmark BF16 Baseline W8A8 (this model) Recovery
GSM8K (5-shot) 0.8825 0.8832 100.08%

Evaluation Command

lm_eval \
 --model vllm \
 --model_args pretrained=amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2,dtype=bfloat16,max_model_len=4096 \
 --tasks gsm8k \
 --batch_size auto \
 --trust_remote_code \
 --num_fewshot 5 \
 --apply_chat_template \
 --log_samples \
 --gen_kwargs "max_gen_toks=2048" \
 --output_path .

Limitations

  • Version Lock: This model is compatible with ZenTorch v2.11.0.1 / PyTorch v2.11. It may not load correctly on other versions.
  • CPU Only: This model is optimized for AMD EPYC CPU inference via ZenDNN. It is not intended for GPU inference.

License

This model is distributed under the same license as the source model. See the LICENSE file for details.

Modifications copyright (c) 2026 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
468
Safetensors
Model size
21B params
Tensor type
BF16
·
I8
·

Model tree for amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2

Quantized
(11)
this model

Collection including amd/gpt-oss-20b-BF16-w8a8-llmcompressor-v0.10.0.2