VOOZH about

URL: https://huggingface.co/empero-ai/openNemo-9B-Claude-Opus-4.6-distill

โ‡ฑ empero-ai/openNemo-9B-Claude-Opus-4.6-distill ยท Hugging Face


openNemo-9B-Claude-Opus-4.6-distill

๐Ÿ‘ openNemo

Reasoning-distilled version of openNemo-9B, fine-tuned on Claude Opus 4.6 reasoning traces.

Trained with SFT + DPO on community-curated reasoning distillation datasets to produce step-by-step <think> chains before answering. Built on the openNemo pure-PyTorch Nemotron-H architecture โ€” no mamba-ssm or causal-conv1d required.

By Empero AI


What is this?

A 9B dense hybrid model (Mamba2 + Transformer) that has been taught to reason through problems before answering, using reasoning traces distilled from Claude Opus 4.6 and other frontier models.

The two-stage training pipeline:

  1. SFT โ€” teaches the reasoning format: <think> tags, step-by-step chains, edge-case consideration
  2. DPO โ€” teaches preference for thorough reasoning over skipping the thinking step

Quickstart

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 4-bit quantization (fits in ~8 GB VRAM)
bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16,
 bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
 "empero-ai/openNemo-9B-Claude-Opus-4.6-distill",
 quantization_config=bnb_config,
 trust_remote_code=True,
 device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("empero-ai/openNemo-9B-Claude-Opus-4.6-distill")

messages = [
 {"role": "system", "content": "You are a deep reasoning AI. When given a problem, you think through it carefully and methodically inside <think> tags before providing your final answer."},
 {"role": "user", "content": "Prove that the sum of the first n odd numbers equals nยฒ."},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=2048, do_sample=True, temperature=0.7, top_p=0.95)
response = tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Without thinking (instruct mode)

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)

Architecture

9B dense hybrid Nemotron-H โ€” same architecture as the base openNemo-9B:

Parameter Value
Total parameters ~9B
Architecture Hybrid Mamba2 + GQA Transformer + MLP
Layers 52 (Mamba2 SSM + GQA Attention + MLP)
Max context length 262,144 tokens
Vocabulary size 131,072

Training Details

Stage 1: Supervised Fine-Tuning (SFT)

Trained on 8 reasoning distillation datasets:

Dataset Type Approx. Size
nohurry/Opus-4.6-Reasoning-3000x-filtered problem/thinking/solution ~3,000
Roman1111111/claude-opus-4.6-10000x messages ~10,000
Crownelius/Opus-4.6-Reasoning-3300x problem/thinking/solution ~3,300
TeichAI/claude-haiku-4.5-high-reasoning-1700x messages ~1,700
TeichAI/Claude-Opus-4.6-Reasoning-927x messages ~927
Jackrong/Qwen3.5-reasoning-700x conversation ~700
dalisoft/claude-opus-4.6-high-reasoning-700x messages ~700
TeichAI/claude-4.5-opus-high-reasoning-250x messages ~250
Hastagaras/Claude-Sonnet-X-Opus-4.6-Reasoning-small-500 messages ~500

Stage 2: Direct Preference Optimization (DPO)

Preference pairs constructed from the same datasets:

  • Chosen: Full response with <think> reasoning chain
  • Rejected: Same response with <think> block stripped

Additional DPO source: QuietImpostor/Sao10K-Claude-3-Opus-Instruct-15K-ShareGPT

Hyperparameters

Parameter SFT DPO
Method QLoRA (4-bit NF4) QLoRA (4-bit NF4)
LoRA rank (r) 32 โ€” (continues SFT adapter)
LoRA alpha 64 โ€”
LoRA targets q/k/v/o_proj, gate/up/down_proj โ€”
Learning rate 1e-4 5e-5
Scheduler Cosine Cosine
Optimizer paged_adamw_8bit paged_adamw_8bit
Epochs 2 2
Batch size 1 1
Gradient accumulation 16 16
Max sequence length 4,096 2,048
DPO beta โ€” 0.1
Precision bf16 bf16
Gradient checkpointing Yes Yes

GGUF

Quantized GGUF versions are available at empero-ai/openNemo-9B-Claude-Opus-4.6-distill-GGUF.

Requirements

torch>=2.1
transformers>=4.40
bitsandbytes>=0.43 # for 4-bit quantization

No mamba-ssm. No causal-conv1d. No CUDA kernel compilation.

Base Model

This model is built on empero-ai/openNemo-9B, a pure-PyTorch drop-in replacement for NVIDIA's Nemotron-H that removes all external CUDA kernel dependencies. See the base model card for details on the architecture changes.

Citation

@misc{openNemo-9B-Claude-Opus-distill,
 title={openNemo-9B-Claude-Opus-4.6-distill},
 author={Empero AI},
 year={2026},
 url={https://huggingface.co/empero-ai/openNemo-9B-Claude-Opus-4.6-distill}
}

License

NVIDIA Open Model License โ€” same as the base model.

Acknowledgments

  • Base model: openNemo-9B by Empero AI
  • Original architecture: Nemotron-H by NVIDIA
  • Reasoning datasets: Community contributors (nohurry, Roman1111111, Crownelius, TeichAI, Jackrong, dalisoft, Hastagaras, Sao10K)
Downloads last month
993
Safetensors
Model size
9B params
Tensor type
BF16
ยท

Model tree for empero-ai/openNemo-9B-Claude-Opus-4.6-distill

Datasets used to train empero-ai/openNemo-9B-Claude-Opus-4.6-distill

Collection including empero-ai/openNemo-9B-Claude-Opus-4.6-distill