Chinchilla-300M-FFN-Swiglu

Final validation loss: 2.664274 (next-token cross-entropy, nats)

This release is part of a 300M-parameter Chinchilla-optimal comparison of dense SwiGLU vs ConvSwiGLU FFN variants on English Wikipedia Markdown pretraining.

Both models share identical training hyperparameters except for the FFN block:

Decoder-only transformer, GQA attention, RMSNorm, RoPE
Token Superposition Training (TST) (bag size s=6, 30% of steps)
Multi-Token Prediction (MTP) depth 1 (auxiliary head, weight 0.1)
Aurora optimizer on 2D matrix weights + AdamW on embeddings/norms
WSD learning-rate schedule, NVFP4 training precision (Transformer Engine)
~6B Chinchilla-optimal tokens per variant (20 tokens / parameter)

Validation loss is next-token cross-entropy in nats on a held-out Wikipedia Markdown shard (en-00014), 16k BPE vocabulary. Lower is better. MTP auxiliary heads are not used for the reported val loss (main LM head only).

Experiment Results

Model	Params	Phase 1 val (TST end)	Final val	Δ vs SwiGLU
SwiGLU baseline	299,257,728	4.768	2.664	—
ConvSwiGLU	299,423,104	4.774	2.515	−0.149

ConvSwiGLU achieves 5.6% lower validation loss at nearly identical parameter count. Phase 1 (TST) was tied; ConvSwiGLU lagged early in recovery then overtook SwiGLU around step 28k and finished stronger.

Phase 2 validation checkpoints (same global step)

Step	SwiGLU val	ConvSwiGLU val
12,000	2.582	2.671
16,000	2.530	2.810
24,000	2.648	2.722
28,000	2.711	2.682
32,000	2.832	2.648
34,000	2.728	2.553
36,000	2.668	2.516
Final	2.664	2.515

WandB training logs: urm-conv-swiglu-300m

About This Model

The SwiGLU baseline in the 300M Chinchilla-optimal comparison. Standard gated MLP FFN with fused gate+up projection. Serves as the control for evaluating whether depthwise causal convolution in the FFN (ConvSwiGLU) improves language modeling at scale.

Architecture

Parameter	Value
Vocabulary size	16,000
Hidden dimension (d_model)	896
Layers	18
Attention	GQA — 14 query heads, 14 KV heads
Head dimension	64
FFN hidden dimension	4,352
FFN variant	SwiGLU (fused gate+up)
Max sequence length	2,048
RoPE θ	10,000
Normalization	RMSNorm (pre-attention, pre-FFN)
Tied embeddings	Yes
MTP depth	1 (training only; not used at inference)
Total parameters	299,257,728

Training

Setting	Value
Total optimizer steps	36,530
Chinchilla token budget	5,985,154,560 (~20 tok/param)
Batch size	2 sequences
Gradient accumulation	16
Tokens per optimizer step (recovery)	65,536
Sequence length (recovery)	2,048
Precision	NVFP4 (Transformer Engine linear swap)
LR schedule	WSD — 1096 warmup, stable, 3653-step decay to 0
AdamW lr (embeddings / norms / conv)	3e-4
Aurora lr (2D matrix weights)	0.02
Weight decay	0.1
Gradient clip	1.0
Seed	0
Hardware	NVIDIA RTX 5090 (32 GB), `urm-dev:26.04` Docker
Eval every	2,000 steps
Checkpoint every	5,000 steps

Optimizer routing: Aurora on 2D hidden matrices; AdamW on embeddings, norm weights, biases, and Conv1d depthwise weights (3D tensors).

Token Superposition Training (TST)

Phase	Steps	Seq len (model)	Objective
Phase 1 — superposition	10,959	2,048 compressed (from 12,288 raw)	Multi-hot CE over 6-token bags
Phase 2 — recovery	25,571	2,048	Standard next-token CE

TST bag size s=6, step ratio 0.30. Optimizer state and LR schedule carry over from phase 1 to phase 2 (unified schedule). torch.compile disabled for checkpoint compatibility.

Multi-Token Prediction

MTP depth 1, loss weight 0.1. During TST phase 1, MTP predicts bag-shifted targets; during recovery, standard +1 token prediction. MTP weights are saved in the checkpoint but ignored at inference (main LM head only).

Dataset

Trained on open-index/open-wikipedia-markdown.

Split	Shards
Train	`en-00001` … `en-00004` (~6B tokens, `chinchilla_train.bin`)
Validation	`en-00014` (`data/tokens/val.bin`)

Tokenized with the project 16,000-token BPE vocabulary (data/tokenizer.json). Legacy shard en-00000 was excluded from training to avoid overlap with prior experiments.

Usage

Each repo includes modeling_chinchilla_300m.py — self-contained inference (SwiGLU and ConvSwiGLU).

pip install torch safetensors tokenizers

python modeling_chinchilla_300m.py --repo_dir . --prompt "The theory of" --max_new_tokens 200

from modeling_chinchilla_300m import load_model, generate

model, tokenizer = load_model(".", device="cuda")
print(generate(model, tokenizer, "In quantum mechanics,", max_new_tokens=200, temperature=1.0, top_k=50))

Limitations

Research-scale 300M model; not a production system.
Trained on English Wikipedia Markdown only.
Custom architecture — use the bundled modeling_chinchilla_300m.py for inference.
MTP auxiliary heads are present in weights but unused at inference.

References

Gouge, H. Training codebase and infrastructure.
Peng, B., Gigant, E., Quesnelle, J. Token Superposition Training. arXiv:2605.06546
DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv:2412.19437
Tilde Research. Aurora optimizer. blog
Universal Reasoning Model — ConvSwiGLU module. arXiv:2512.14693

Downloads last month: 55

Safetensors

Model size

0.3B params

Tensor type

BF16

Dataset used to train hudsongouge/Chinchilla-300M-FFN-Swiglu

Papers for hudsongouge/Chinchilla-300M-FFN-Swiglu

Paper • 2605.06546 • Published May 7 • 47

Paper • 2512.14693 • Published Dec 16, 2025 • 44

Paper • 2412.19437 • Published Dec 27, 2024 • 87

URL: https://huggingface.co/hudsongouge/Chinchilla-300M-FFN-Swiglu

⇱ hudsongouge/Chinchilla-300M-FFN-Swiglu · Hugging Face