AOMTS-TST-s6-100M-3k-1MTP-v1

Validation loss: 2.204673 (next-token cross-entropy, nats)

Part of the Aurora Optimized Multi-Token Superposition (AOMTS) experiment series.

This series evaluates whether Token Superposition Training (TST) and Multi-Token Prediction (MTP) improve language model quality, and whether combining them yields further gains.

Key findings: TST alone improved validation loss by ~0.073 nats over the base (no-TST, no-MTP) model. MTP=1 alone improved by ~0.011 nats. Combining TST with MTP=1 achieved the best result in the series at 2.204673 nats — a total improvement of ~0.083 nats over the base. TST+MTP=2 did not improve over TST+MTP=1, suggesting diminishing returns beyond one MTP head at this scale.

Validation loss is next-token cross-entropy in nats, evaluated on a held-out Wikipedia Markdown validation set using the same 16,000-token BPE vocabulary. Lower is better.

Research artifact. These checkpoints are screening-scale models (3,000 steps, ~100M parameters) released for research and ablation comparison. They are not intended as production models.

Why 3,000 steps? After dozens of prior experiments running 15,000+ steps, it was consistently observed that the winning model was already ahead of competing runs within the first 2,000 steps. Running to 3,000 steps provides a clear signal while keeping turnaround fast enough to run many conditions in parallel.

Why ~100M parameters? After many experiments at 200M–500M parameters, the model that won at larger scale consistently also won at ~100M. Screening at ~100M is therefore a reliable and efficient proxy: top candidates from this series will be scaled further.

What might change at scale? At ~100M parameters and 3,000 steps, the model has limited capacity to predict far into the future — which likely explains why MTP=1 was optimal and MTP=2 did not help. A small model trained on relatively little data cannot reliably leverage the signal from heads that predict multiple steps ahead; the additional auxiliary loss may add noise rather than useful gradient. At larger model sizes and longer training runs, the optimal MTP depth is expected to increase as the model gains the capacity to make accurate multi-step predictions. Similarly, the optimal TST bag size (s=6 here) may shift with scale — larger models may benefit from larger or smaller bags depending on how effectively they can decompress the superposition signal during recovery. Further research is needed to determine how these findings scale across model size, training budget, and TST bag size.

About This Model

The best model in the AOMTS series. Combines Token Superposition Training (TST, bag size s=6) with one Multi-Token Prediction (MTP) head, achieving the lowest validation loss at 2.204673 nats — a total improvement of ~0.083 nats over the no-TST, no-MTP baseline (AOMTS-Base-100M-3k-0MTP-v1-run2). TST and MTP=1 are complementary; adding a second MTP head (AOMTS-TST-s6-100M-3k-2MTP-v1) does not further improve results at this scale.

Architecture

Parameter	Value
Vocabulary size	16,000
Hidden dimension (d_model)	512
Layers	12
Attention heads	8
KV heads	8
Head dimension	64
FFN hidden dimension	4,800
FFN variant	SwiGLU
Max sequence length	2,048
RoPE θ	10,000
Normalization	RMSNorm
Tied embeddings	Yes

Total parameters: 126,401,024
- Embeddings (tied tok_emb / lm_head): 8,192,000
- Non-embedding, non-MTP (transformer blocks): 109,261,312 (identical across all AOMTS runs)
- MTP heads (1 × 8,947,712): 8,947,712

MTP parameters are auxiliary training heads. They are not used during standard language modeling evaluation and do not affect validation loss — the val loss reported here is computed from the main head only. The non-embedding, non-MTP parameter count (109,261,312) is identical across all runs in this series.

Training

Setting	Value
Total steps	3,000
Batch size	16 sequences
Gradient accumulation	2
Effective batch size	32 sequences / 65,536 model-context tokens per step
Total raw tokens seen	491,520,000 (phase 1 processes 12,288 raw tokens/step via bag_size=6 expansion — same 3,000 training steps as all other AOMTS runs)
Sequence length	2,048
LR schedule	WSD — 150 warmup steps, stable LR, then linear decay over the last 300 steps (final 10 % of training) to 0.0
Warmup steps	150
Min LR	0.0
Weight decay	0.1

Optimizer: Aurora (matrix weights) + AdamW (embeddings & norms)

Aurora matrix weight lr: 0.02
AdamW embedding/norm lr: 0.0003
Weight decay: 0.1
Gradient clip: 1.0

Multi-Token Prediction (MTP)

MTP depth: 1 additional prediction head MTP loss weight: 0.1 Parameters per MTP head: 8,947,712 Total MTP parameters: 8,947,712

During standard (phase 2 / non-TST) training, each MTP head (d = 1) predicts the token at position i+1+d given the hidden state at position i. The MTP auxiliary loss is added to the main CE loss with weight 0.1.

MTP during TST phase 1 (bag-shifted predictions):

In the superposition phase the model operates on a compressed sequence where each position represents a bag of 6 raw tokens. The main head at compressed position i predicts the next bag — raw tokens [(i+1)·6, …, (i+2)·6−1] — using Multi-hot Cross-Entropy (MCE) loss (average CE over all 6 targets).

MTP head d predicts a bag shifted d tokens forward relative to the main target: raw tokens [(i+1)·6+d, …, (i+2)·6+d−1]. Concretely:

MTP head 1: raw tokens [(i+1)·6+1, …, (i+2)·6] (shifted 1 token past the main target)

This ensures all MTP heads receive meaningful gradient signal during phase 1. In phase 2 (recovery), MTP heads revert to standard next-token prediction.

Token Superposition Training (TST)

This model was trained with Token Superposition Training (TST), following arXiv:2605.06546 (Peng, Gigant, Quesnelle — Nous Research).

Phase 1 — Superposition (900 steps, 30% of budget)

Token embeddings are grouped into non-overlapping bags of 6 tokens
Each bag is averaged into a single embedding vector
The model operates on a compressed sequence of length 2048 (from 12288 raw tokens)
Training objective: Multi-hot Cross-Entropy (MCE) — predict the next bag of 6 tokens

Phase 2 — Recovery (2100 steps, 70% of budget)

Standard next-token cross-entropy prediction on the original sequence length (2048)
Model weights, optimizer state, and LR schedule continue from phase 1 (unified schedule)

TST Setting	Value
Bag size (s)	6
Phase 1 steps	900
Phase 2 steps	2100
Optimizer state carried over	Yes
LR schedule carried over	Yes

Dataset

Trained on open-index/open-wikipedia-markdown (Wikipedia Markdown). Tokenized with a custom 16,000-token BPE vocabulary.

Total raw tokens seen: 491,520,000
Model-context tokens per step: 65,536 (16 seqs × 2 grad accum × 2048 seq len)

Note on token count: TST phase 1 processes 6× more raw tokens per step because each sequence position is formed by averaging a bag of 6 token embeddings. This model was not trained for more steps than the base models — all AOMTS runs use the same 3,000-step budget. The higher raw token count reflects the bag expansion in phase 1 only.

Usage

Each repo includes modeling_aomts.py — a self-contained inference script with no external model code required.

pip install torch safetensors tokenizers

Command-line generation:

python modeling_aomts.py --repo_dir /path/to/repo --prompt "The theory of" --max_new_tokens 200

Python API:

from modeling_aomts import load_model, generate

model, tokenizer = load_model(".") # add device="cuda" for GPU
print(generate(model, tokenizer, "The theory of relativity states",
 max_new_tokens=200, temperature=1.0, top_k=50))

Generation options: temperature (lower = less random; 0 = greedy), top_k, top_p (nucleus sampling), max_new_tokens, device, dtype.

Full Experiment Comparison

All AOMTS models at a glance (equal 3,000-step budget, sorted by validation loss):

Model	MTP Depth	TST	LR Schedule	Optim Reset	Val Loss
AOMTS-TST-s6-100M-3k-1MTP-v1 ← this model	1	Yes (s=6¹, 900 steps)	WSD	—	2.204673
AOMTS-TST-s6-100M-3k-0MTP-v1	0	Yes (s=6¹, 900 steps)	WSD	—
AOMTS-TST-s6-100M-3k-2MTP-v1	2	Yes (s=6¹, 900 steps)	WSD	—	2.214605
AOMTS-Base-100M-3k-1MTP-v1	1	No	WSD	—	2.276289
AOMTS-Base-100M-3k-2MTP-v1	2	No	WSD	—	2.284260
AOMTS-Base-100M-3k-0MTP-v1-run2	0	No	WSD	—	2.287432
AOMTS-TST-s6-100M-3k-0MTP-RESET-v1	0	Yes (s=6¹, 900 steps)	WSD	Yes²	2.302689
AOMTS-Base-100M-3k-1MTP-Cosine-v1	1	No	Cosine	—	2.354897
AOMTS-Base-100M-3k-0MTP-v1	0	No	WSD	—	2.375539
¹ s = bag size: the number of raw tokens averaged into each compressed embedding position during TST phase 1.

² Optim Reset = phase 2 restarted the optimizer state and LR schedule from scratch rather than carrying them over from phase 1. Models without this flag use a unified schedule across both phases.

References

Peng, B., Gigant, E., Quesnelle, J. (Nous Research, 2025). Token Superposition Training for Language Model Pretraining. arXiv:2605.06546
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437

Downloads last month: 44

Safetensors

Model size

0.1B params

Tensor type

BF16

Dataset used to train hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1

Collection including hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1

9 items • Updated May 17 • 1

Papers for hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1

Paper • 2605.06546 • Published May 7 • 47

Paper • 2412.19437 • Published Dec 27, 2024 • 87

URL: https://huggingface.co/hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1

⇱ hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1 · Hugging Face