Chinchilla-300M-FFN-Swiglu
Final validation loss: 2.664274 (next-token cross-entropy, nats)
This release is part of a 300M-parameter Chinchilla-optimal comparison of dense SwiGLU vs ConvSwiGLU FFN variants on English Wikipedia Markdown pretraining.
Both models share identical training hyperparameters except for the FFN block:
- Decoder-only transformer, GQA attention, RMSNorm, RoPE
- Token Superposition Training (TST) (bag size s=6, 30% of steps)
- Multi-Token Prediction (MTP) depth 1 (auxiliary head, weight 0.1)
- Aurora optimizer on 2D matrix weights + AdamW on embeddings/norms
- WSD learning-rate schedule, NVFP4 training precision (Transformer Engine)
- ~6B Chinchilla-optimal tokens per variant (20 tokens / parameter)
Validation loss is next-token cross-entropy in nats on a held-out Wikipedia Markdown shard
(en-00014), 16k BPE vocabulary. Lower is better. MTP auxiliary heads are not used for
the reported val loss (main LM head only).
Experiment Results
| Model | Params | Phase 1 val (TST end) | Final val | Δ vs SwiGLU |
|---|---|---|---|---|
| SwiGLU baseline | 299,257,728 | 4.768 | 2.664 | — |
| ConvSwiGLU | 299,423,104 | 4.774 | 2.515 | −0.149 |
ConvSwiGLU achieves 5.6% lower validation loss at nearly identical parameter count. Phase 1 (TST) was tied; ConvSwiGLU lagged early in recovery then overtook SwiGLU around step 28k and finished stronger.
Phase 2 validation checkpoints (same global step)
| Step | SwiGLU val | ConvSwiGLU val |
|---|---|---|
| 12,000 | 2.582 | 2.671 |
| 16,000 | 2.530 | 2.810 |
| 24,000 | 2.648 | 2.722 |
| 28,000 | 2.711 | 2.682 |
| 32,000 | 2.832 | 2.648 |
| 34,000 | 2.728 | 2.553 |
| 36,000 | 2.668 | 2.516 |
| Final | 2.664 | 2.515 |
WandB training logs: urm-conv-swiglu-300m
About This Model
The SwiGLU baseline in the 300M Chinchilla-optimal comparison. Standard gated MLP FFN with fused gate+up projection. Serves as the control for evaluating whether depthwise causal convolution in the FFN (ConvSwiGLU) improves language modeling at scale.
Architecture
| Parameter | Value |
|---|---|
| Vocabulary size | 16,000 |
| Hidden dimension (d_model) | 896 |
| Layers | 18 |
| Attention | GQA — 14 query heads, 14 KV heads |
| Head dimension | 64 |
| FFN hidden dimension | 4,352 |
| FFN variant | SwiGLU (fused gate+up) |
| Max sequence length | 2,048 |
| RoPE θ | 10,000 |
| Normalization | RMSNorm (pre-attention, pre-FFN) |
| Tied embeddings | Yes |
| MTP depth | 1 (training only; not used at inference) |
| Total parameters | 299,257,728 |
Training
| Setting | Value |
|---|---|
| Total optimizer steps | 36,530 |
| Chinchilla token budget | 5,985,154,560 (~20 tok/param) |
| Batch size | 2 sequences |
| Gradient accumulation | 16 |
| Tokens per optimizer step (recovery) | 65,536 |
| Sequence length (recovery) | 2,048 |
| Precision | NVFP4 (Transformer Engine linear swap) |
| LR schedule | WSD — 1096 warmup, stable, 3653-step decay to 0 |
| AdamW lr (embeddings / norms / conv) | 3e-4 |
| Aurora lr (2D matrix weights) | 0.02 |
| Weight decay | 0.1 |
| Gradient clip | 1.0 |
| Seed | 0 |
| Hardware | NVIDIA RTX 5090 (32 GB), urm-dev:26.04 Docker |
| Eval every | 2,000 steps |
| Checkpoint every | 5,000 steps |
Optimizer routing: Aurora on 2D hidden matrices; AdamW on embeddings, norm weights, biases, and Conv1d depthwise weights (3D tensors).
Token Superposition Training (TST)
| Phase | Steps | Seq len (model) | Objective |
|---|---|---|---|
| Phase 1 — superposition | 10,959 | 2,048 compressed (from 12,288 raw) | Multi-hot CE over 6-token bags |
| Phase 2 — recovery | 25,571 | 2,048 | Standard next-token CE |
TST bag size s=6, step ratio 0.30. Optimizer state and LR schedule carry over from phase 1
to phase 2 (unified schedule). torch.compile disabled for checkpoint compatibility.
Multi-Token Prediction
MTP depth 1, loss weight 0.1. During TST phase 1, MTP predicts bag-shifted targets; during recovery, standard +1 token prediction. MTP weights are saved in the checkpoint but ignored at inference (main LM head only).
Dataset
Trained on open-index/open-wikipedia-markdown.
| Split | Shards |
|---|---|
| Train | en-00001 … en-00004 (~6B tokens, chinchilla_train.bin) |
| Validation | en-00014 (data/tokens/val.bin) |
Tokenized with the project 16,000-token BPE vocabulary (data/tokenizer.json).
Legacy shard en-00000 was excluded from training to avoid overlap with prior experiments.
Usage
Each repo includes modeling_chinchilla_300m.py — self-contained inference (SwiGLU and ConvSwiGLU).
pip install torch safetensors tokenizers
python modeling_chinchilla_300m.py --repo_dir . --prompt "The theory of" --max_new_tokens 200
from modeling_chinchilla_300m import load_model, generate
model, tokenizer = load_model(".", device="cuda")
print(generate(model, tokenizer, "In quantum mechanics,", max_new_tokens=200, temperature=1.0, top_k=50))
Limitations
- Research-scale 300M model; not a production system.
- Trained on English Wikipedia Markdown only.
- Custom architecture — use the bundled
modeling_chinchilla_300m.pyfor inference. - MTP auxiliary heads are present in weights but unused at inference.
References
- Gouge, H. Training codebase and infrastructure.
- Peng, B., Gigant, E., Quesnelle, J. Token Superposition Training. arXiv:2605.06546
- DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv:2412.19437
- Tilde Research. Aurora optimizer. blog
- Universal Reasoning Model — ConvSwiGLU module. arXiv:2512.14693
- Downloads last month
- 55
