openNemo-9B-Claude-Opus-4.6-distill

Reasoning-distilled version of openNemo-9B, fine-tuned on Claude Opus 4.6 reasoning traces.

Trained with SFT + DPO on community-curated reasoning distillation datasets to produce step-by-step <think> chains before answering. Built on the openNemo pure-PyTorch Nemotron-H architecture — no mamba-ssm or causal-conv1d required.

By Empero AI

What is this?

A 9B dense hybrid model (Mamba2 + Transformer) that has been taught to reason through problems before answering, using reasoning traces distilled from Claude Opus 4.6 and other frontier models.

The two-stage training pipeline:

SFT — teaches the reasoning format: <think> tags, step-by-step chains, edge-case consideration
DPO — teaches preference for thorough reasoning over skipping the thinking step

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantization (fits in ~8 GB VRAM)
bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16,
 bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
 "empero-ai/openNemo-9B-Claude-Opus-4.6-distill",
 quantization_config=bnb_config,
 trust_remote_code=True,
 device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("empero-ai/openNemo-9B-Claude-Opus-4.6-distill")

messages = [
 {"role": "system", "content": "You are a deep reasoning AI. When given a problem, you think through it carefully and methodically inside <think> tags before providing your final answer."},
 {"role": "user", "content": "Prove that the sum of the first n odd numbers equals n²."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=0.95)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Without thinking (instruct mode)

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)

Architecture

9B dense hybrid Nemotron-H — same architecture as the base openNemo-9B:

Parameter	Value
Total parameters	~9B
Architecture	Hybrid Mamba2 + GQA Transformer + MLP
Layers	52 (Mamba2 SSM + GQA Attention + MLP)
Max context length	262,144 tokens
Vocabulary size	131,072

Training Details

Stage 1: Supervised Fine-Tuning (SFT)

Trained on 8 reasoning distillation datasets:

Dataset	Type	Approx. Size
nohurry/Opus-4.6-Reasoning-3000x-filtered	problem/thinking/solution	~3,000
Roman1111111/claude-opus-4.6-10000x	messages	~10,000
Crownelius/Opus-4.6-Reasoning-3300x	problem/thinking/solution	~3,300
TeichAI/claude-haiku-4.5-high-reasoning-1700x	messages	~1,700
TeichAI/Claude-Opus-4.6-Reasoning-927x	messages	~927
Jackrong/Qwen3.5-reasoning-700x	conversation	~700
dalisoft/claude-opus-4.6-high-reasoning-700x	messages	~700
TeichAI/claude-4.5-opus-high-reasoning-250x	messages	~250
Hastagaras/Claude-Sonnet-X-Opus-4.6-Reasoning-small-500	messages	~500

Stage 2: Direct Preference Optimization (DPO)

Preference pairs constructed from the same datasets:

Chosen: Full response with <think> reasoning chain
Rejected: Same response with <think> block stripped

Additional DPO source: QuietImpostor/Sao10K-Claude-3-Opus-Instruct-15K-ShareGPT

Hyperparameters

Parameter	SFT	DPO
Method	QLoRA (4-bit NF4)	QLoRA (4-bit NF4)
LoRA rank (r)	32	— (continues SFT adapter)
LoRA alpha	64	—
LoRA targets	q/k/v/o_proj, gate/up/down_proj	—
Learning rate	1e-4	5e-5
Scheduler	Cosine	Cosine
Optimizer	paged_adamw_8bit	paged_adamw_8bit
Epochs	2	2
Batch size	1	1
Gradient accumulation	16	16
Max sequence length	4,096	2,048
DPO beta	—	0.1
Precision	bf16	bf16
Gradient checkpointing	Yes	Yes

GGUF

Quantized GGUF versions are available at empero-ai/openNemo-9B-Claude-Opus-4.6-distill-GGUF.

Requirements

torch>=2.1
transformers>=4.40
bitsandbytes>=0.43 # for 4-bit quantization

No mamba-ssm. No causal-conv1d. No CUDA kernel compilation.

Base Model

This model is built on empero-ai/openNemo-9B, a pure-PyTorch drop-in replacement for NVIDIA's Nemotron-H that removes all external CUDA kernel dependencies. See the base model card for details on the architecture changes.

Citation

@misc{openNemo-9B-Claude-Opus-distill,
 title={openNemo-9B-Claude-Opus-4.6-distill},
 author={Empero AI},
 year={2026},
 url={https://huggingface.co/empero-ai/openNemo-9B-Claude-Opus-4.6-distill}
}

License

NVIDIA Open Model License — same as the base model.

Acknowledgments

Base model: openNemo-9B by Empero AI
Original architecture: Nemotron-H by NVIDIA
Reasoning datasets: Community contributors (nohurry, Roman1111111, Crownelius, TeichAI, Jackrong, dalisoft, Hastagaras, Sao10K)

Downloads last month: 993

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for empero-ai/openNemo-9B-Claude-Opus-4.6-distill

Base model

nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base

Finetuned

nvidia/NVIDIA-Nemotron-Nano-12B-v2

Finetuned

nvidia/NVIDIA-Nemotron-Nano-9B-v2

Finetuned

empero-ai/openNemo-9B

Finetuned

(2)

this model

Quantizations

2 models

Datasets used to train empero-ai/openNemo-9B-Claude-Opus-4.6-distill

Collection including empero-ai/openNemo-9B-Claude-Opus-4.6-distill

A collection of our 9B openNemo models • 3 items • Updated Mar 24

URL: https://huggingface.co/empero-ai/openNemo-9B-Claude-Opus-4.6-distill

⇱ empero-ai/openNemo-9B-Claude-Opus-4.6-distill · Hugging Face

openNemo-9B-Claude-Opus-4.6-distill

What is this?

Quickstart

Without thinking (instruct mode)

Architecture

Training Details

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Direct Preference Optimization (DPO)

Hyperparameters

GGUF

Requirements

Base Model

Citation

License

Acknowledgments

Model tree for empero-ai/openNemo-9B-Claude-Opus-4.6-distill

Datasets used to train empero-ai/openNemo-9B-Claude-Opus-4.6-distill

Collection including empero-ai/openNemo-9B-Claude-Opus-4.6-distill