gpt-oss-120b-Fable-5-Distilled

A LoRA fine-tuned MoE coding agent model distilled from real-world Claude Code programming sessions. Merged into a single complete model — download and run directly on Apple Silicon via MLX.

Recommended temperature: 0.95 — this model was trained on opencode agent traces where the assistant alternates between reasoning, tool calls, and code generation. Slightly higher temperature preserves this multi-modal behavior.

Model Summary

gpt-oss-120b-Fable-5-Distilled is a fully merged LoRA fine-tune of gpt-oss-120b-heretic-v2-mxfp4-q8-hi-mlx, trained using the Muon optimizer with Newton-Schulz orthogonalization on Apple MLX. The fine-tuning data consists of 352 real-world opencode programming sessions extracted from armand0e/claude-fable-5-claude-code.

This is a complete model upload — no separate adapter files or base model are needed.

The model learns to act as a coding agent: understanding user requests, exploring repositories, making code edits via tool calls, and delivering final solutions — all through the Harmony channel protocol (analysis for internal reasoning, final for the delivered response).

Highlights

🔧 Agent-style coding — trained on real tool_use traces (Bash, Read, Edit, TaskCreate, Git) from Claude Code sessions. Generates natural tool-call sequences.
🧠 Muon optimizer — Newton-Schulz orthogonalized updates on LoRA matrices for stronger convergence at low training step counts.
🎯 13/13 PASS — perfect score on a 13-task custom benchmark spanning math olympiad, algorithm coding, DS/algo, logic puzzles, and scientific computing.
🌡️ Recommended temperature: 0.95 — preserves the agent's multi-step reasoning + tool-call rhythm.
🍎 Fully on-device — no cloud API, no GPU cluster. Pure MLX inference on Mac.
📦 Full model — merged weights included. No adapter setup required.
📐 Harmony channel format — analysis (chain-of-thought) and final (response) channel separation.

Tool Call details

See TOOL_CALLING.md for the full tool calling guide.

Quick Start

macOS + MLX

# Using mlx_lm (installed by default with mlx)
mlx_lm.chat --model gpt-oss-120b-Fable-5-Distilled --max-tokens 10000 --temp 0.95

Python

from mlx_tune import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
 model_name="gpt-oss-120b-Fable-5-Distilled",
 max_seq_length=4096,
 load_in_4bit=True,
)

prompt = """<|start|>system<|message|>You are a helpful coding assistant.
Reasoning: high
# Valid channels: analysis, commentary, final<|end|>
<|start|>user<|message|>Write a function to find the longest common substring of two strings.<|end|>
<|start|>assistant<|channel|>analysis<|message|>"""

outputs = model.generate(prompt, max_new_tokens=2048, temperature=0.95, top_p=0.9)
print(outputs)

Prompt Format (Harmony)

This model uses the gpt-oss Harmony channel protocol:

<|start|>system<|message|>{system prompt}
Reasoning: high
# Valid channels: analysis, commentary, final<|end|>
<|start|>user<|message|>{your request — including coding tasks, tool-call requests, etc.}<|end|>
<|start|>assistant<|channel|>analysis<|message|>{internal chain-of-thought}<|end|>
<|start|>assistant<|channel|>final<|message|>{final answer / tool call sequence}<|return|>

The analysis channel contains the model's internal reasoning. Display or hide depending on use case.
The final channel contains the deliverable response (code, tool calls, or natural language).
Always prompt the model to begin with <|start|>assistant<|channel|>analysis<|message|>.

Training Details

Parameter	Value
Base model	`gpt-oss-120b-heretic-v2-mxfp4-q8-hi-mlx`
Architecture	MoE (Mixture of Experts), 120B params, top-k=4, 128 experts
Quantization	mxfp4 weights + q8 hi-precision activations
Framework	Apple MLX + mlx-tune
Optimizer	Muon — Newton-Schulz orthogonalized momentum (ndim >= 2) + AdamW fallback (1-D params)
LoRA rank / alpha	r=16, α=32
LoRA dropout	0.0
LoRA target modules	q/k/v/o/gate/up/down projections + router (SwitchLinear expert layers)
Router treatment	LoRA on router (top-k=4 — sufficient competition, no special unfreezing needed)
Muon learning rate	1e-5
AdamW learning rate	1e-5
Muon momentum	0.95
Newton-Schulz steps	5
Effective batch size	2 × grad accum 8 = 16
Training steps	40
Warmup ratio	0.05
LR scheduler	Cosine decay
Max sequence length	4096 tokens
Peak memory during training	~110 GB unified memory
Training hardware	Apple Silicon M-series (>= 64 GB recommended)

Dataset

Source: armand0e/claude-fable-5-claude-code
Type: Opencode session logs — real-world Claude Code programming agent traces
Extracted samples: 352 user → assistant turns (from 55 session files, ~100 sessions)
Content: tool_use sequences (Bash, Read, Edit, TaskCreate, Git), code generation, multi-step agent workflows
Format: Converted from opencode event format → Harmony channel protocol
Split: 90% train (317) / 10% validation (35), seed=42

Muon Optimizer

Muon provides faster convergence than AdamW for matrix-shaped parameters by orthogonalizing gradient updates via Newton-Schulz iteration:

Matrix params (ndim >= 2): LoRA A/B matrices → Muon (Nesterov momentum + NS orthogonalization + shape-aware scaling)
1-D params / bias: → AdamW fallback

Unlike our previous BAdam-based models, this release uses uniform Muon training across all LoRA layers, including the router. At top-k=4, the router competition space is sufficient to avoid expert collapse without special lr treatment.

Benchmark Results

Custom hard benchmark — 13 tasks across math, coding, logic, and science. Auto-graded with keyword matching + live code execution (60s timeout per task, max 2 retries on error).

ID	Task	Verdict	KW%	Time	Tok/s
math_02	Euler Totient Sum — last 6 digits	✅ PASS	100%	41.9s	13
math_03	Lattice Paths Avoiding the Anti-Diagonal	✅ PASS	100%	98.2s	16
math_04	Segmented Sieve in [10¹², 10¹²+10⁶]	✅ PASS	100%	71.8s	15
code_01	Median of Two Sorted Arrays	✅ PASS	100%	62.7s	14
code_02	Thread-Safe LRU Cache with TTL	✅ PASS	100%	114.7s	14
code_03	Persistent Segment Tree — K-th Query	✅ PASS	100%	194.7s	16
code_04	Multi-Head Attention + RoPE (NumPy)	✅ PASS	100%	114.3s	13
code_05	Dijkstra vs A* on Geometric Graph	✅ PASS	100%	119.1s	13
logic_01	Knights & Knaves — Exhaustive SAT	✅ PASS	100%	41.7s	15
logic_02	Verify Three Mathematical Claims	✅ PASS	100%	47.2s	14
sci_01	Figure-8 Three-Body Orbit — Energy Conservation	✅ PASS	100%	101.9s	14
sci_02	Metropolis-Hastings vs HMC Comparison	✅ PASS	100%	110.5s	13
sci_03	Optimizer Comparison on Rosenbrock	✅ PASS	100%	70.4s	12

Overall: 13/13 PASS | Avg KW: 100.0% | Total benchmark time: 1189s

All 13 tasks pass on the first attempt, zero retries required. The model consistently generates correct, executable code with full keyword coverage.

Key Benchmark Details

code_01 Median — O(log n) binary search vs merge: 5656× speedup, correct.
code_02 LRU Cache — 8-thread stress test: 710k ops/sec, hit_rate 49.8%, 200k evictions.
code_03 Persistent Segment Tree — Q1=3, Q2=2, Q3=31, Q4=1 all correct.
code_04 Multi-Head Attention + RoPE — causal mask verified, max_future_attn = 0.0.
sci_01 Three-body — energy drift only 3.0e-14, numerical precision excellent.
sci_02 HMC — ESS 379k (MH: 1,930), gradient direction correct (p += grad_log_target).

Inference Speed

Platform	Throughput
Apple Silicon M-series (MLX, 4-bit)	12–16 tok/s
Peak memory (inference)	~69 GB

This is a 120B MoE model running entirely on-device on a Mac — no cloud dependency, no quantization compromise.

Limitations

Apple Silicon only — quantized and optimized for MLX. Not compatible with CUDA/ROCm without re-quantization.
40-step checkpoint — an early fine-tune designed for quick validation. Further training would deepen the coding agent behavior.
No RLHF — trained on supervised opencode session data only. Safety alignment is lighter than commercial instruction-tuned models.
Harmony prompt format required — this model expects the <|channel|> protocol. Standard ChatML or Alpaca-style prompts will produce degraded results.
Small dataset — 352 turns is modest. The benchmark results suggest high-quality, high-signal data, but broader generalization benefits from more training data.

Citation / Acknowledgements

Base model: gpt-oss-120b-heretic-v2
Training dataset: armand0e/claude-fable-5-claude-code
Optimizer: Muon — Muon: An optimizer for hidden layers in neural networks
Framework: Apple MLX · mlx-tune
Agent traces from: Claude Code (Anthropic) via opencode session logs

Downloads last month: 1,410

Safetensors

Model size

117B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

URL: https://huggingface.co/cloudyu/gpt-oss-120b-Fable-5-Distilled

⇱ cloudyu/gpt-oss-120b-Fable-5-Distilled · Hugging Face