9 items β’ Updated
π§ Nano-Nano v5.1
~1218.3 M Β· Qwen3 Β· 300M Β· GQA + QK-Norm Β· Sequence-Packed Β· 26 Datasets
π License
π Loss
π Eval
π Params
π Datasets
Fully redesigned successor to Nano-nano v4.5.
~298M Qwen3 parameters trained with sequence packing on a quality-tiered 34-dataset mix.
Features loss-boost system: auto-extends training if loss > 4.95 (up to 3Γ75 steps).
Goal: loss < 2.5 through compute efficiency, not raw scale.
π Summary
| Architecture | LLaMA decoder-only |
| Parameters | ~1218.3 M |
| Context | 2 048 tokens |
| Vocabulary | 50,304 tokens |
| Training loss | 2.0444 |
| Eval score | 16.7% |
| Tokens trained | 0.01 B (sequence-packed) |
| Hardware | GTX 1080 8 GB (Pascal) |
ποΈ Architecture (v4 β v4.5 β v5.1)
| Hyperparameter | v4 | v4.5 | v5.1 |
|---|---|---|---|
| Parameters | ~236 M | ~256 M | ~1218.3 M (~1.218 B) |
hidden_size |
896 | 896 | 1 024 |
intermediate_size |
2 688 | 2 912 | 2 730 (8/3Γhidden) |
num_hidden_layers |
14 | 15 | 16 |
num_attention_heads |
14 | 14 | 16 |
num_key_value_heads |
14 | 14 | 16 |
head_dim |
64 | 64 | 64 |
vocab_size |
50 264 | 50 264 | 50,304 |
max_position_embeddings |
1 024 | 2 048 | 2 048 |
rms_norm_eps |
1e-6 | 1e-6 | 1e-5 |
rope_theta |
10 000 | 10 000 | 10 000 |
rope_scaling |
β | linear 2Γ | linear 2Γ |
tie_word_embeddings |
False | False | False |
| Sequence packing | β | β | β 1Γ packed |
| Architecture | LLaMA | LLaMA | Qwen3 |
| GQA (KV heads) | 14 full | 16 full | 8 (16Q/8KV) |
| QK-Norm | β | β | β |
| rope_theta | 10k | 10k | 1M |
π Evaluation
| Category | Hits | Score |
|---|---|---|
| Knowledge | 0/5 | π΄ 0% |
| Reasoning | 0/4 | π΄ 0% |
| Hallucination | 0/4 | π΄ 0% |
| Instruction | 2/4 | π‘ 50% |
| Coherence | 1/3 | π΄ 33% |
| Overall | β | π΄ 17% |
Hallucination resistance tests whether the model correctly declines or hedges on unanswerable questions (future events, fictional entities, impossible premises).
π Category Scores
π Hallucination Resistance
π Training Curves
π³ Training
What's new in v5.1
| Change | v4.5 | v5.1 | Why |
|---|---|---|---|
| Sequence packing | β padding waste | β 100% tokens | ~3Γ more signal per step |
| Dataset quality | mixed web+instruction | GPT-4 quality-tiered | Faster loss reduction |
| Parameters | ~256 M | ~1218.3 M (~1.218 B) | Better capacity |
| Datasets | 15 | 21 | More diversity |
| LR | 1e-4 | 2e-4 | 1e-4 was too conservative |
Settings
| Setting | Value |
|---|---|
| Hardware | GTX 1080 8 GB Β· Pascal Β· CUDA 6.1 |
| Precision | fp32 weights / fp16 AMP (GradScaler) |
| Optimizer | StovetopCooker (HyperNix, pre-Volta) + cosine |
| LR | 0.0002 cosine |
| Warmup | 8% |
| Embedding freeze | First 20% of steps |
| Effective batch | 8 Γ 512 = 4,096 tokens/step |
| Loss boost | β€3 rounds of 75 steps if loss > 4.95 |
| Sequence packing | β streaming, 1Γ epochs, 150,000 chunks cap |
| Grad clipping | 5.0 |
| Grad checkpointing | β |
| Peak VRAM | 5.44 GB |
| Final loss | 2.0444 |
Dataset Mix (21 datasets, quality-tiered)
| Tier | Dataset | Samples | Weight | Category |
|---|---|---|---|---|
| 1 | Open-Orca/OpenOrca |
40 k | 3.0Γ | GPT-4 reasoning |
| 1 | meta-math/MetaMathQA |
30 k | 2.8Γ | Math augmentation |
| 1 | Roman1111111/claude-opus-4.6-10000x |
10 k | 2.5Γ | Claude conversations |
| 1 | WizardLM/WizardLM_evol_instruct_V2_196k |
25 k | 2.5Γ | Evolved instruction |
| 1 | WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K |
25 k | 2.5Γ | Reasoning traces |
| 2 | microsoft/orca-math-word-problems-200k |
20 k | 2.2Γ | Math word problems |
| 2 | lighteval/MATH-Hard |
10 k | 2.2Γ | Hard math |
| 2 | HuggingFaceH4/MATH-500 |
500 | 2.2Γ | Competition math |
| 2 | garage-bAInd/Open-Platypus |
25 k | 2.0Γ | Reasoning instruction |
| 2 | teknium/OpenHermes-2.5 |
30 k | 2.0Γ | GPT-4 instruction |
| 3 | ise-uiuc/Magicoder-OSS-Instruct-75K |
20 k | 1.8Γ | Code instruction |
| 3 | m-a-p/CodeFeedback-Filtered-Instruction |
15 k | 1.8Γ | Code + feedback |
| 3 | iamtarun/python_code_instructions_18k_alpaca |
8 k | 1.6Γ | Python code |
| 3 | nvidia/OpenCodeInstruct |
20 k | 1.5Γ | Code instruction |
| 3 | b-mc2/sql-create-context |
6 k | 1.4Γ | SQL generation |
| 4 | HuggingFaceH4/ultrachat_200k |
30 k | 1.5Γ | Multi-turn chat |
| 4 | databricks/databricks-dolly-15k |
15 k | 1.2Γ | Instruction following |
| 4 | Amod/mental_health_counseling_conversations |
5 k | 1.0Γ | Counseling chat |
| 4 | mlabonne/guanaco-llama2-1k |
1 k | 1.0Γ | General QA |
| 5 | ray0rf1re/FineWeb-Nano |
20 k | 0.8Γ | Web text |
| 5 | ray0rf1re/hyper-pip |
85 | 3.0Γ | HyperNix pip data |
| 3 | flytech/python-codes-25k |
20 k | 1.7Γ | Python code solutions |
| 3 | ByteDance-Seed/Code-Contests-Plus |
15 k | 1.6Γ | Competitive coding |
| 1 | open-thoughts/OpenThoughts-TB-dev |
20 k | 2.3Γ | Verified thinking traces |
| 6 | Nix-ai/cat-math-v1 |
5 k | 0.3Γ | Cat math (niche) |
| 6 | Nix-ai/Cat-v2.8XXXL-plus |
5 k | 0.3Γ | Cat general (niche) |
π Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"ray0rf1re/Nano-Nano_v5.1", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-Nano_v5.1")
def chat(prompt: str, max_new_tokens: int = 256) -> str:
# <think> opens the reasoning block; model outputs reasoning then </think> then answer
text = ("<|im_start|>user
" + prompt + "<|im_end|>
"
"<|im_start|>assistant
<think>
")
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(
**inputs, max_new_tokens=max_new_tokens,
do_sample=True, temperature=0.7, top_p=0.9,
repetition_penalty=1.1,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(out[0][inputs["input_ids"].shape[-1]:],
skip_special_tokens=True).strip()
print(chat("Write a Python function to merge two sorted lists."))
print(chat("Solve: if 3x + 7 = 22, what is x?"))
print(chat("Explain transformer attention in simple terms."))
β οΈ Limitations
- Context limited to 2 048 tokens
- Trained on 0.01 B tokens β far below production scale
- Pascal GPU (GTX 1080): fp16 AMP only, no bf16
- Not RLHF/DPO aligned
π Citation
@misc{nano-nano-v5,
author = {ray0rf1re},
title = {Nano-Nano v5.1: 300M LLaMA with Sequence Packing},
year = {2026},
publisher = {HuggingFace},
howpublished = {https://huggingface.co/ray0rf1re/Nano-Nano_v5.1},
}
- Downloads last month
- 1,282
Safetensors
Model size
1B params
Tensor type
F16
Β·
Datasets used to train ray0rf1re/Nano-Nano_v5.1
Collection including ray0rf1re/Nano-Nano_v5.1
Evaluation results
- Training Lossself-reported2.044
- Overall Eval Scoreself-reported0.167
- Knowledgeself-reported0.000
- Reasoningself-reported0.000
- Hallucination Resistanceself-reported0.000
- Instruction Followingself-reported0.500
- Coherenceself-reported0.333
