RedHatAI/Qwen3-8B-speculator.peagle

This is a DFlash speculator model for Qwen/Qwen3-8B.

Training Details

This model was trained using the Speculators library on a subset of Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered and the train_sft split of HuggingFaceH4/ultrachat_200k. Responses were regenerated by Qwen3-8B (with reasoning).


Base Model	Qwen/Qwen3-8B
Chat Template	Qwen/Qwen3-8B (use `/chat/completions` endpoint)
Format	Safetensors
License	Apache 2.0
Validation Hardware	Nvidia H100

# Install vLLM 
 
# Deploy with speculative decoding 
vllm serve RedHatAI/Qwen3-8B-speculator.peagle

Per-position token acceptance rates across datasets:
(with reasoning enabled)

Dataset	Pos 1	Pos 2	Pos 3	Pos 4	Pos 5	Pos 6	Pos 7	Avg Length
HumanEval	81.3%	59.0%	41.1%	27.9%	18.8%	12.8%	8.9%	3.500
math_reasoning	83.3%	63.5%	47.0%	34.3%	24.4%	17.2%	11.8%	3.820
qa	70.5%	44.7%	27.6%	17.1%	10.8%	7.1%	4.8%	2.830
question	74.6%	49.6%	31.6%	20.2%	13.1%	8.5%	5.6%	3.030
rag	73.6%	48.4%	29.8%	18.4%	11.3%	6.9%	4.1%	2.930
summarization	68.0%	39.0%	21.0%	10.8%	5.4%	2.6%	1.2%	2.480
tool_call	73.7%	47.6%	28.7%	17.1%	10.3%	6.2%	3.7%	2.870
translation	73.8%	47.7%	28.7%	17.3%	10.4%	6.5%	4.1%	2.890
writing	75.0%	50.0%	32.1%	20.6%	13.3%	8.7%	5.7%	3.050