VOOZH about

URL: https://huggingface.co/ray0rf1re/Nano-Nano_v5.1

⇱ ray0rf1re/Nano-Nano_v5.1 Β· Hugging Face


🧠 Nano-Nano v5.1

~1218.3 M Β· Qwen3 Β· 300M Β· GQA + QK-Norm Β· Sequence-Packed Β· 26 Datasets

πŸ‘ License
πŸ‘ Loss
πŸ‘ Eval
πŸ‘ Params
πŸ‘ Datasets

Fully redesigned successor to Nano-nano v4.5.
~298M Qwen3 parameters trained with sequence packing on a quality-tiered 34-dataset mix. Features loss-boost system: auto-extends training if loss > 4.95 (up to 3Γ—75 steps).
Goal: loss < 2.5 through compute efficiency, not raw scale.


πŸ“‹ Summary

Architecture LLaMA decoder-only
Parameters ~1218.3 M
Context 2 048 tokens
Vocabulary 50,304 tokens
Training loss 2.0444
Eval score 16.7%
Tokens trained 0.01 B (sequence-packed)
Hardware GTX 1080 8 GB (Pascal)

πŸ—οΈ Architecture (v4 β†’ v4.5 β†’ v5.1)

Hyperparameter v4 v4.5 v5.1
Parameters ~236 M ~256 M ~1218.3 M (~1.218 B)
hidden_size 896 896 1 024
intermediate_size 2 688 2 912 2 730 (8/3Γ—hidden)
num_hidden_layers 14 15 16
num_attention_heads 14 14 16
num_key_value_heads 14 14 16
head_dim 64 64 64
vocab_size 50 264 50 264 50,304
max_position_embeddings 1 024 2 048 2 048
rms_norm_eps 1e-6 1e-6 1e-5
rope_theta 10 000 10 000 10 000
rope_scaling β€” linear 2Γ— linear 2Γ—
tie_word_embeddings False False False
Sequence packing ❌ ❌ βœ… 1Γ— packed
Architecture LLaMA LLaMA Qwen3
GQA (KV heads) 14 full 16 full 8 (16Q/8KV)
QK-Norm ❌ ❌ βœ…
rope_theta 10k 10k 1M

πŸ“Š Evaluation

Category Hits Score
Knowledge 0/5 πŸ”΄ 0%
Reasoning 0/4 πŸ”΄ 0%
Hallucination 0/4 πŸ”΄ 0%
Instruction 2/4 🟑 50%
Coherence 1/3 πŸ”΄ 33%
Overall β€” πŸ”΄ 17%

Hallucination resistance tests whether the model correctly declines or hedges on unanswerable questions (future events, fictional entities, impossible premises).

πŸ‘ Category Scores
πŸ‘ Hallucination Resistance
πŸ‘ Training Curves


🍳 Training

What's new in v5.1

Change v4.5 v5.1 Why
Sequence packing ❌ padding waste βœ… 100% tokens ~3Γ— more signal per step
Dataset quality mixed web+instruction GPT-4 quality-tiered Faster loss reduction
Parameters ~256 M ~1218.3 M (~1.218 B) Better capacity
Datasets 15 21 More diversity
LR 1e-4 2e-4 1e-4 was too conservative

Settings

Setting Value
Hardware GTX 1080 8 GB Β· Pascal Β· CUDA 6.1
Precision fp32 weights / fp16 AMP (GradScaler)
Optimizer StovetopCooker (HyperNix, pre-Volta) + cosine
LR 0.0002 cosine
Warmup 8%
Embedding freeze First 20% of steps
Effective batch 8 Γ— 512 = 4,096 tokens/step
Loss boost ≀3 rounds of 75 steps if loss > 4.95
Sequence packing βœ… streaming, 1Γ— epochs, 150,000 chunks cap
Grad clipping 5.0
Grad checkpointing βœ…
Peak VRAM 5.44 GB
Final loss 2.0444

Dataset Mix (21 datasets, quality-tiered)

Tier Dataset Samples Weight Category
1 Open-Orca/OpenOrca 40 k 3.0Γ— GPT-4 reasoning
1 meta-math/MetaMathQA 30 k 2.8Γ— Math augmentation
1 Roman1111111/claude-opus-4.6-10000x 10 k 2.5Γ— Claude conversations
1 WizardLM/WizardLM_evol_instruct_V2_196k 25 k 2.5Γ— Evolved instruction
1 WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K 25 k 2.5Γ— Reasoning traces
2 microsoft/orca-math-word-problems-200k 20 k 2.2Γ— Math word problems
2 lighteval/MATH-Hard 10 k 2.2Γ— Hard math
2 HuggingFaceH4/MATH-500 500 2.2Γ— Competition math
2 garage-bAInd/Open-Platypus 25 k 2.0Γ— Reasoning instruction
2 teknium/OpenHermes-2.5 30 k 2.0Γ— GPT-4 instruction
3 ise-uiuc/Magicoder-OSS-Instruct-75K 20 k 1.8Γ— Code instruction
3 m-a-p/CodeFeedback-Filtered-Instruction 15 k 1.8Γ— Code + feedback
3 iamtarun/python_code_instructions_18k_alpaca 8 k 1.6Γ— Python code
3 nvidia/OpenCodeInstruct 20 k 1.5Γ— Code instruction
3 b-mc2/sql-create-context 6 k 1.4Γ— SQL generation
4 HuggingFaceH4/ultrachat_200k 30 k 1.5Γ— Multi-turn chat
4 databricks/databricks-dolly-15k 15 k 1.2Γ— Instruction following
4 Amod/mental_health_counseling_conversations 5 k 1.0Γ— Counseling chat
4 mlabonne/guanaco-llama2-1k 1 k 1.0Γ— General QA
5 ray0rf1re/FineWeb-Nano 20 k 0.8Γ— Web text
5 ray0rf1re/hyper-pip 85 3.0Γ— HyperNix pip data
3 flytech/python-codes-25k 20 k 1.7Γ— Python code solutions
3 ByteDance-Seed/Code-Contests-Plus 15 k 1.6Γ— Competitive coding
1 open-thoughts/OpenThoughts-TB-dev 20 k 2.3Γ— Verified thinking traces
6 Nix-ai/cat-math-v1 5 k 0.3Γ— Cat math (niche)
6 Nix-ai/Cat-v2.8XXXL-plus 5 k 0.3Γ— Cat general (niche)

πŸš€ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
 "ray0rf1re/Nano-Nano_v5.1", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-Nano_v5.1")

def chat(prompt: str, max_new_tokens: int = 256) -> str:
 # <think> opens the reasoning block; model outputs reasoning then </think> then answer
 text = ("<|im_start|>user
" + prompt + "<|im_end|>
"
 "<|im_start|>assistant
<think>
")
 inputs = tokenizer(text, return_tensors="pt").to(model.device)
 out = model.generate(
 **inputs, max_new_tokens=max_new_tokens,
 do_sample=True, temperature=0.7, top_p=0.9,
 repetition_penalty=1.1,
 pad_token_id=tokenizer.eos_token_id,
 )
 return tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:],
 skip_special_tokens=True).strip()

print(chat("Write a Python function to merge two sorted lists."))
print(chat("Solve: if 3x + 7 = 22, what is x?"))
print(chat("Explain transformer attention in simple terms."))

⚠️ Limitations

  • Context limited to 2 048 tokens
  • Trained on 0.01 B tokens β€” far below production scale
  • Pascal GPU (GTX 1080): fp16 AMP only, no bf16
  • Not RLHF/DPO aligned

πŸ“œ Citation

@misc{nano-nano-v5,
 author = {ray0rf1re},
 title = {Nano-Nano v5.1: 300M LLaMA with Sequence Packing},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {https://huggingface.co/ray0rf1re/Nano-Nano_v5.1},
}
Downloads last month
1,282
Safetensors
Model size
1B params
Tensor type
F16
Β·

Datasets used to train ray0rf1re/Nano-Nano_v5.1

Collection including ray0rf1re/Nano-Nano_v5.1

Evaluation results