A newer version of this model is available: ray0rf1re/Nano-nano-4.6

🧠 Nano-nano v4.5

~255.7 M · LLaMA · Instruction-tuned · From scratch

Successor to Nano-nano v4.
Same architecture family, ~8.5% larger, trained from scratch on 15 carefully weighted datasets.

📋 Quick Facts


Architecture	LLaMA (decoder-only)
Parameters	~255.7 M
Context length	2 048 tokens
Vocabulary	50,264 tokens
Training loss	`5.1763`
Eval score	`16.7%`
Trained on	0.08 B tokens
Hardware	NVIDIA GTX 1080 8 GB (Pascal)
Trained	2026-05-09 22:50

🏗️ Architecture

Standard LLaMA decoder-only transformer. Scaled ~8.5% wider + 1 extra layer vs v4.

Hyperparameter	v4	v4.5
Parameters	~236 M	~255.7 M
`hidden_size`	896	896
`intermediate_size`	2 688	2 912
`num_hidden_layers`	14	15
`num_attention_heads`	14	14
`num_key_value_heads`	14	14
`head_dim`	64	64
`vocab_size`	50 264	50,264
`max_position_embeddings`	1 024	2 048
`rms_norm_eps`	1e-6	1e-6
`rope_theta`	10 000	10 000
`hidden_act`	SiLU	SiLU
`tie_word_embeddings`	False	False
`attention_bias`	False	False
`mlp_bias`	False	False

📊 Evaluation

Automatically evaluated after training across 5 capability dimensions.

Category	Hits	Score
Knowledge	0/5	🔴 0%
Reasoning	0/4	🔴 0%
Hallucination	0/4	🔴 0%
Instruction	2/4	🟡 50%
Coherence	1/3	🔴 33%
Overall	—	🔴 17%

Hallucination resistance — whether the model appropriately declines questions about future events, fictional entities, or impossible premises rather than confabulating.

👁 Category Scores
👁 Hallucination
👁 Training Curves

🍳 Training

Setting	Value
Hardware	GTX 1080 8 GB · Pascal · CUDA 6.1
Precision	fp32 weights / fp16 AMP (GradScaler)
Optimizer	StovetopCooker (HyperNix, pre-Volta)
LR	`0.0001` cosine decay
Warmup	6% of steps
Embedding freeze	First 15% of steps
Effective batch	8 × 2048 = 16,384 tokens/step
Steps	5092
Total tokens	0.08 B
Grad clipping	1.0
Grad checkpointing	✅
Peak VRAM	5.34 GB
HyperNix	✅ `freezer` · `StovetopCooker` · `old_fridge` · `new_fridge` · `smoke_alarm` · `pans` · `smoker`

Dataset Mix

Dataset	Samples	Weight	Category
`Roman1111111/claude-opus-4.6-10000x`	10 k	2.5×	Claude conversations
`WithinUsAI/GPT5.5_thinking_max_distill_god_seed_25K`	25 k	2.0×	Reasoning / thinking
`HuggingFaceH4/MATH-500`	500	2.0×	Competition math
`lighteval/MATH-Hard`	10 k	2.0×	Hard math
`garage-bAInd/Open-Platypus`	25 k	1.8×	Reasoning instruction
`iamtarun/python_code_instructions_18k_alpaca`	8 k	1.6×	Python code
`b-mc2/sql-create-context`	6 k	1.4×	SQL code
`nvidia/OpenCodeInstruct`	30 k	1.5×	Code instruction
`teknium/OpenHermes-2.5`	30 k	1.5×	General instruction
`Amod/mental_health_counseling_conversations`	5 k	1.2×	Chat / counseling
`ray0rf1re/FineWeb-Nano`	50 k	1.0×	Web text
`tonytins/chat-dataset`	10 k	1.0×	Conversation
`databricks/databricks-dolly-15k`	15 k	1.0×	Instruction following
`mlabonne/guanaco-llama2-1k`	1 k	1.0×	General QA
`ray0rf1re/hyper-pip`	20 k	2.0×	HyperNix pip data
`HuggingFaceH4/ultrachat_200k`	30 k	1.5×	Multi-turn chat
`fka/awesome-chatgpt-prompts`	5 k	0.8×	Prompt engineering

🚀 Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
 "ray0rf1re/Nano-nano_v4.5",
 torch_dtype="auto",
 device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ray0rf1re/Nano-nano_v4.5")

def generate(prompt: str, max_new_tokens: int = 256) -> str:
 text = f"### Instruction:
{prompt}

### Response:
"
 inputs = tokenizer(text, return_tensors="pt").to(model.device)
 out = model.generate(
 **inputs,
 max_new_tokens = max_new_tokens,
 do_sample = True,
 temperature = 0.7,
 top_p = 0.9,
 repetition_penalty = 1.1,
 pad_token_id = tokenizer.eos_token_id,
 )
 new_ids = out[0][inputs["input_ids"].shape[-1]:]
 return tokenizer.decode(new_ids, skip_special_tokens=True).strip()

# Examples
print(generate("Write a Python function to reverse a linked list."))
print(generate("What is the capital of France?"))
print(generate("Explain gradient descent in simple terms."))

⚠️ Limitations

Context limited to 1 024 tokens — unsuitable for long documents
Trained on 0.08 B tokens — far less than production models
May hallucinate on obscure or out-of-distribution queries
Not RLHF/DPO aligned — outputs may vary in safety and tone
Pascal GPU constraint (GTX 1080): fp32/fp16 only, no bf16

📜 Citation

@misc{nano-nano-v45,
 author = {ray0rf1re},
 title = {Nano-nano v4.5: Compact LLaMA-Family Causal LM},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {https://huggingface.co/ray0rf1re/Nano-nano_v4.5},
}

Downloads last month: 1,865

Safetensors

Model size

0.3B params

Tensor type

F32

Datasets used to train ray0rf1re/Nano-nano_v4.5

Collection including ray0rf1re/Nano-nano_v4.5

9 items • Updated 1 day ago

Evaluation results

Training Loss
self-reported
5.176
Overall Eval Score
self-reported
0.167
Knowledge
self-reported
0.000
Reasoning
self-reported
0.000
Hallucination Resistance
self-reported
0.000
Instruction Following
self-reported
0.500
Coherence
self-reported
0.333

URL: https://huggingface.co/ray0rf1re/Nano-nano_v4.5

⇱ ray0rf1re/Nano-nano_v4.5 · Hugging Face