🧠 Nano-Nano v5.1

~1218.3 M · Qwen3 · 300M · GQA + QK-Norm · Sequence-Packed · 26 Datasets

👁 License
👁 Loss
👁 Eval
👁 Params
👁 Datasets

Fully redesigned successor to Nano-nano v4.5.
~298M Qwen3 parameters trained with sequence packing on a quality-tiered 34-dataset mix. Features loss-boost system: auto-extends training if loss > 4.95 (up to 3×75 steps).
Goal: loss < 2.5 through compute efficiency, not raw scale.

📋 Summary


Architecture	LLaMA decoder-only
Parameters	~1218.3 M
Context	2 048 tokens
Vocabulary	50,304 tokens
Training loss	`2.0444`
Eval score	`16.7%`
Tokens trained	0.01 B (sequence-packed)
Hardware	GTX 1080 8 GB (Pascal)

🏗️ Architecture (v4 → v4.5 → v5.1)

Hyperparameter	v4	v4.5	v5.1
Parameters	~236 M	~256 M	~1218.3 M (~1.218 B)
`hidden_size`	896	896	1 024
`intermediate_size`	2 688	2 912	2 730 (8/3×hidden)
`num_hidden_layers`	14	15	16
`num_attention_heads`	14	14	16
`num_key_value_heads`	14	14	16
`head_dim`	64	64	64
`vocab_size`	50 264	50 264	50,304
`max_position_embeddings`	1 024	2 048	2 048
`rms_norm_eps`	1e-6	1e-6	1e-5
`rope_theta`	10 000	10 000	10 000
`rope_scaling`	—	linear 2×	linear 2×
`tie_word_embeddings`	False	False	False
Sequence packing	❌	❌	✅ 1× packed
Architecture	LLaMA	LLaMA	Qwen3
GQA (KV heads)	14 full	16 full	8 (16Q/8KV)
QK-Norm	❌	❌	✅
rope_theta	10k	10k	1M

📊 Evaluation

Category	Hits	Score
Knowledge	0/5	🔴 0%
Reasoning	0/4	🔴 0%
Hallucination	0/4	🔴 0%
Instruction	2/4	🟡 50%
Coherence	1/3	🔴 33%
Overall	—	🔴 17%

Hallucination resistance tests whether the model correctly declines or hedges on unanswerable questions (future events, fictional entities, impossible premises).

👁 Category Scores
👁 Hallucination Resistance
👁 Training Curves

🍳 Training

What's new in v5.1

Change	v4.5	v5.1	Why
Sequence packing	❌ padding waste	✅ 100% tokens	~3× more signal per step
Dataset quality	mixed web+instruction	GPT-4 quality-tiered	Faster loss reduction
Parameters	~256 M	~1218.3 M (~1.218 B)	Better capacity
Datasets	15	21	More diversity
LR	1e-4	2e-4	1e-4 was too conservative

Settings

Setting	Value
Hardware	GTX 1080 8 GB · Pascal · CUDA 6.1
Precision	fp32 weights / fp16 AMP (GradScaler)
Optimizer	StovetopCooker (HyperNix, pre-Volta) + cosine
LR	`0.0002` cosine
Warmup	8%
Embedding freeze	First 20% of steps
Effective batch	8 × 512 = 4,096 tokens/step
Loss boost	≤3 rounds of 75 steps if loss > 4.95
Sequence packing	✅ streaming, 1× epochs, 150,000 chunks cap
Grad clipping	5.0
Grad checkpointing	✅
Peak VRAM	5.44 GB
Final loss	`2.0444`

Dataset Mix (21 datasets, quality-tiered)

Tier	Dataset	Samples	Weight	Category
1	`Open-Orca/OpenOrca`	40 k	3.0×	GPT-4 reasoning
1	`meta-math/MetaMathQA`	30 k	2.8×	Math augmentation
1	`Roman1111111/claude-opus-4.6-10000x`	10 k	2.5×	Claude conversations
1	`WizardLM/WizardLM_evol_instruct_V2_196k`	25 k	2.5×	Evolved instruction
1	`WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K`	25 k	2.5×	Reasoning traces
2	`microsoft/orca-math-word-problems-200k`	20 k	2.2×	Math word problems
2	`lighteval/MATH-Hard`	10 k	2.2×	Hard math
2	`HuggingFaceH4/MATH-500`	500	2.2×	Competition math
2	`garage-bAInd/Open-Platypus`	25 k	2.0×	Reasoning instruction
2	`teknium/OpenHermes-2.5`	30 k	2.0×	GPT-4 instruction
3	`ise-uiuc/Magicoder-OSS-Instruct-75K`	20 k	1.8×	Code instruction
3	`m-a-p/CodeFeedback-Filtered-Instruction`	15 k	1.8×	Code + feedback
3	`iamtarun/python_code_instructions_18k_alpaca`	8 k	1.6×	Python code
3	`nvidia/OpenCodeInstruct`	20 k	1.5×	Code instruction
3	`b-mc2/sql-create-context`	6 k	1.4×	SQL generation
4	`HuggingFaceH4/ultrachat_200k`	30 k	1.5×	Multi-turn chat
4	`databricks/databricks-dolly-15k`	15 k	1.2×	Instruction following
4	`Amod/mental_health_counseling_conversations`	5 k	1.0×	Counseling chat
4	`mlabonne/guanaco-llama2-1k`	1 k	1.0×	General QA
5	`ray0rf1re/FineWeb-Nano`	20 k	0.8×	Web text
5	`ray0rf1re/hyper-pip`	85	3.0×	HyperNix pip data
3	`flytech/python-codes-25k`	20 k	1.7×	Python code solutions
3	`ByteDance-Seed/Code-Contests-Plus`	15 k	1.6×	Competitive coding
1	`open-thoughts/OpenThoughts-TB-dev`	20 k	2.3×	Verified thinking traces
6	`Nix-ai/cat-math-v1`	5 k	0.3×	Cat math (niche)
6	`Nix-ai/Cat-v2.8XXXL-plus`	5 k	0.3×	Cat general (niche)

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
 "ray0rf1re/Nano-Nano_v5.1", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-Nano_v5.1")

def chat(prompt: str, max_new_tokens: int = 256) -> str:
 # <think> opens the reasoning block; model outputs reasoning then </think> then answer
 text = ("<|im_start|>user
" + prompt + "<|im_end|>
"
 "<|im_start|>assistant
<think>
")
 inputs = tokenizer(text, return_tensors="pt").to(model.device)
 out = model.generate(
 **inputs, max_new_tokens=max_new_tokens,
 do_sample=True, temperature=0.7, top_p=0.9,
 repetition_penalty=1.1,
 pad_token_id=tokenizer.eos_token_id,
 )
 return tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:],
 skip_special_tokens=True).strip()

print(chat("Write a Python function to merge two sorted lists."))
print(chat("Solve: if 3x + 7 = 22, what is x?"))
print(chat("Explain transformer attention in simple terms."))

⚠️ Limitations

Context limited to 2 048 tokens
Trained on 0.01 B tokens — far below production scale
Pascal GPU (GTX 1080): fp16 AMP only, no bf16
Not RLHF/DPO aligned

📜 Citation

@misc{nano-nano-v5,
 author = {ray0rf1re},
 title = {Nano-Nano v5.1: 300M LLaMA with Sequence Packing},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {https://huggingface.co/ray0rf1re/Nano-Nano_v5.1},
}

Downloads last month: 1,282

Safetensors

Model size

1B params

Tensor type

F16

Datasets used to train ray0rf1re/Nano-Nano_v5.1

Collection including ray0rf1re/Nano-Nano_v5.1

9 items • Updated 1 day ago

Evaluation results

Training Loss
self-reported
2.044
Overall Eval Score
self-reported
0.167
Knowledge
self-reported
0.000
Reasoning
self-reported
0.000
Hallucination Resistance
self-reported
0.000
Instruction Following
self-reported
0.500
Coherence
self-reported
0.333

URL: https://huggingface.co/ray0rf1re/Nano-Nano_v5.1

⇱ ray0rf1re/Nano-Nano_v5.1 · Hugging Face