VOOZH about

URL: https://huggingface.co/hudsongouge/Chinchilla-300M-FFN-Swiglu

⇱ hudsongouge/Chinchilla-300M-FFN-Swiglu · Hugging Face


Chinchilla-300M-FFN-Swiglu

Final validation loss: 2.664274 (next-token cross-entropy, nats)

This release is part of a 300M-parameter Chinchilla-optimal comparison of dense SwiGLU vs ConvSwiGLU FFN variants on English Wikipedia Markdown pretraining.

Both models share identical training hyperparameters except for the FFN block:

  • Decoder-only transformer, GQA attention, RMSNorm, RoPE
  • Token Superposition Training (TST) (bag size s=6, 30% of steps)
  • Multi-Token Prediction (MTP) depth 1 (auxiliary head, weight 0.1)
  • Aurora optimizer on 2D matrix weights + AdamW on embeddings/norms
  • WSD learning-rate schedule, NVFP4 training precision (Transformer Engine)
  • ~6B Chinchilla-optimal tokens per variant (20 tokens / parameter)

Validation loss is next-token cross-entropy in nats on a held-out Wikipedia Markdown shard (en-00014), 16k BPE vocabulary. Lower is better. MTP auxiliary heads are not used for the reported val loss (main LM head only).

Experiment Results

Model Params Phase 1 val (TST end) Final val Δ vs SwiGLU
SwiGLU baseline 299,257,728 4.768 2.664
ConvSwiGLU 299,423,104 4.774 2.515 −0.149

ConvSwiGLU achieves 5.6% lower validation loss at nearly identical parameter count. Phase 1 (TST) was tied; ConvSwiGLU lagged early in recovery then overtook SwiGLU around step 28k and finished stronger.

Phase 2 validation checkpoints (same global step)

Step SwiGLU val ConvSwiGLU val
12,000 2.582 2.671
16,000 2.530 2.810
24,000 2.648 2.722
28,000 2.711 2.682
32,000 2.832 2.648
34,000 2.728 2.553
36,000 2.668 2.516
Final 2.664 2.515

WandB training logs: urm-conv-swiglu-300m

About This Model

The SwiGLU baseline in the 300M Chinchilla-optimal comparison. Standard gated MLP FFN with fused gate+up projection. Serves as the control for evaluating whether depthwise causal convolution in the FFN (ConvSwiGLU) improves language modeling at scale.

Architecture

Parameter Value
Vocabulary size 16,000
Hidden dimension (d_model) 896
Layers 18
Attention GQA — 14 query heads, 14 KV heads
Head dimension 64
FFN hidden dimension 4,352
FFN variant SwiGLU (fused gate+up)
Max sequence length 2,048
RoPE θ 10,000
Normalization RMSNorm (pre-attention, pre-FFN)
Tied embeddings Yes
MTP depth 1 (training only; not used at inference)
Total parameters 299,257,728

Training

Setting Value
Total optimizer steps 36,530
Chinchilla token budget 5,985,154,560 (~20 tok/param)
Batch size 2 sequences
Gradient accumulation 16
Tokens per optimizer step (recovery) 65,536
Sequence length (recovery) 2,048
Precision NVFP4 (Transformer Engine linear swap)
LR schedule WSD — 1096 warmup, stable, 3653-step decay to 0
AdamW lr (embeddings / norms / conv) 3e-4
Aurora lr (2D matrix weights) 0.02
Weight decay 0.1
Gradient clip 1.0
Seed 0
Hardware NVIDIA RTX 5090 (32 GB), urm-dev:26.04 Docker
Eval every 2,000 steps
Checkpoint every 5,000 steps

Optimizer routing: Aurora on 2D hidden matrices; AdamW on embeddings, norm weights, biases, and Conv1d depthwise weights (3D tensors).

Token Superposition Training (TST)

Phase Steps Seq len (model) Objective
Phase 1 — superposition 10,959 2,048 compressed (from 12,288 raw) Multi-hot CE over 6-token bags
Phase 2 — recovery 25,571 2,048 Standard next-token CE

TST bag size s=6, step ratio 0.30. Optimizer state and LR schedule carry over from phase 1 to phase 2 (unified schedule). torch.compile disabled for checkpoint compatibility.

Multi-Token Prediction

MTP depth 1, loss weight 0.1. During TST phase 1, MTP predicts bag-shifted targets; during recovery, standard +1 token prediction. MTP weights are saved in the checkpoint but ignored at inference (main LM head only).

Dataset

Trained on open-index/open-wikipedia-markdown.

Split Shards
Train en-00001en-00004 (~6B tokens, chinchilla_train.bin)
Validation en-00014 (data/tokens/val.bin)

Tokenized with the project 16,000-token BPE vocabulary (data/tokenizer.json). Legacy shard en-00000 was excluded from training to avoid overlap with prior experiments.

Usage

Each repo includes modeling_chinchilla_300m.py — self-contained inference (SwiGLU and ConvSwiGLU).

pip install torch safetensors tokenizers
python modeling_chinchilla_300m.py --repo_dir . --prompt "The theory of" --max_new_tokens 200
from modeling_chinchilla_300m import load_model, generate

model, tokenizer = load_model(".", device="cuda")
print(generate(model, tokenizer, "In quantum mechanics,", max_new_tokens=200, temperature=1.0, top_k=50))

Limitations

  • Research-scale 300M model; not a production system.
  • Trained on English Wikipedia Markdown only.
  • Custom architecture — use the bundled modeling_chinchilla_300m.py for inference.
  • MTP auxiliary heads are present in weights but unused at inference.

References

  • Gouge, H. Training codebase and infrastructure.
  • Peng, B., Gigant, E., Quesnelle, J. Token Superposition Training. arXiv:2605.06546
  • DeepSeek-AI. DeepSeek-V3 Technical Report. arXiv:2412.19437
  • Tilde Research. Aurora optimizer. blog
  • Universal Reasoning Model — ConvSwiGLU module. arXiv:2512.14693
Downloads last month
55
Safetensors
Model size
0.3B params
Tensor type
BF16
·

Dataset used to train hudsongouge/Chinchilla-300M-FFN-Swiglu

Papers for hudsongouge/Chinchilla-300M-FFN-Swiglu