AOMTS-TST-s6-100M-3k-0MTP-RESET-v1
Validation loss: 2.302689 (next-token cross-entropy, nats)
Part of the Aurora Optimized Multi-Token Superposition (AOMTS) experiment series.
This series evaluates whether Token Superposition Training (TST) and Multi-Token Prediction (MTP) improve language model quality, and whether combining them yields further gains.
Key findings: TST alone improved validation loss by ~0.073 nats over the base (no-TST, no-MTP) model. MTP=1 alone improved by ~0.011 nats. Combining TST with MTP=1 achieved the best result in the series at 2.204673 nats — a total improvement of ~0.083 nats over the base. TST+MTP=2 did not improve over TST+MTP=1, suggesting diminishing returns beyond one MTP head at this scale.
Validation loss is next-token cross-entropy in nats, evaluated on a held-out Wikipedia Markdown validation set using the same 16,000-token BPE vocabulary. Lower is better.
Research artifact. These checkpoints are screening-scale models (3,000 steps, ~100M parameters) released for research and ablation comparison. They are not intended as production models.
Why 3,000 steps? After dozens of prior experiments running 15,000+ steps, it was consistently observed that the winning model was already ahead of competing runs within the first 2,000 steps. Running to 3,000 steps provides a clear signal while keeping turnaround fast enough to run many conditions in parallel.
Why ~100M parameters? After many experiments at 200M–500M parameters, the model that won at larger scale consistently also won at ~100M. Screening at ~100M is therefore a reliable and efficient proxy: top candidates from this series will be scaled further.
What might change at scale? At ~100M parameters and 3,000 steps, the model has limited capacity to predict far into the future — which likely explains why MTP=1 was optimal and MTP=2 did not help. A small model trained on relatively little data cannot reliably leverage the signal from heads that predict multiple steps ahead; the additional auxiliary loss may add noise rather than useful gradient. At larger model sizes and longer training runs, the optimal MTP depth is expected to increase as the model gains the capacity to make accurate multi-step predictions. Similarly, the optimal TST bag size (s=6 here) may shift with scale — larger models may benefit from larger or smaller bags depending on how effectively they can decompress the superposition signal during recovery. Further research is needed to determine how these findings scale across model size, training budget, and TST bag size.
About This Model
An early TST run (bag size s=6) where the optimizer state and learning rate schedule were reset at the start of phase 2 (recovery). Specifically: the Adam moment buffers (m, v) were zeroed and the LR schedule restarted from step 0, including a new warmup — as if phase 2 were an entirely separate training run. All other TST models in this series carry optimizer state and LR schedule continuously across the phase 1 → phase 2 transition (unified schedule). The reset caused a significant degradation: 2.302689 nats vs. 2.213959 for the otherwise-comparable AOMTS-TST-s6-100M-3k-0MTP-v1, confirming that preserving optimizer momentum across the phase transition is important.
Architecture
| Parameter | Value |
|---|---|
| Vocabulary size | 16,000 |
| Hidden dimension (d_model) | 512 |
| Layers | 12 |
| Attention heads | 8 |
| KV heads | 8 |
| Head dimension | 64 |
| FFN hidden dimension | 4,800 |
| FFN variant | SwiGLU |
| Max sequence length | 2,048 |
| RoPE θ | 10,000 |
| Normalization | RMSNorm |
| Tied embeddings | Yes |
- Total parameters: 117,453,312
- Embeddings (tied tok_emb / lm_head): 8,192,000
- Non-embedding, non-MTP (transformer blocks): 109,261,312 (identical across all AOMTS runs)
MTP parameters are auxiliary training heads. They are not used during standard language modeling evaluation and do not affect validation loss — the val loss reported here is computed from the main head only. The non-embedding, non-MTP parameter count (109,261,312) is identical across all runs in this series.
Training
| Setting | Value |
|---|---|
| Total steps | 3,000 |
| Batch size | 16 sequences |
| Gradient accumulation | 2 |
| Effective batch size | 32 sequences / 65,536 model-context tokens per step |
| Total raw tokens seen | 491,520,000 (phase 1 processes 12,288 raw tokens/step via bag_size=6 expansion — same 3,000 training steps as all other AOMTS runs) |
| Sequence length | 2,048 |
| LR schedule | WSD — 150 warmup steps, stable LR, then linear decay over the last 300 steps (final 10 % of training) to 0.0 |
| Warmup steps | 150 |
| Min LR | 0.0 |
| Weight decay | 0.1 |
Optimizer: Aurora (matrix weights) + AdamW (embeddings & norms)
- Aurora matrix weight lr: 0.02
- AdamW embedding/norm lr: 0.0003
- Weight decay: 0.1
- Gradient clip: 1.0
Multi-Token Prediction (MTP)
Multi-Token Prediction is not used in this model.
Token Superposition Training (TST)
This model was trained with Token Superposition Training (TST), following arXiv:2605.06546 (Peng, Gigant, Quesnelle — Nous Research).
Phase 1 — Superposition (900 steps, 30% of budget)
- Token embeddings are grouped into non-overlapping bags of 6 tokens
- Each bag is averaged into a single embedding vector
- The model operates on a compressed sequence of length 2048 (from 12288 raw tokens)
- Training objective: Multi-hot Cross-Entropy (MCE) — predict the next bag of 6 tokens
Phase 2 — Recovery (2100 steps, 70% of budget)
- Standard next-token cross-entropy prediction on the original sequence length (2048)
- Model weights, optimizer state, and LR schedule continue from phase 1 (unified schedule) (this run resets — see note below)
| TST Setting | Value |
|---|---|
| Bag size (s) | 6 |
| Phase 1 steps | 900 |
| Phase 2 steps | 2100 |
| Optimizer state carried over | No (reset) |
| LR schedule carried over | No (reset) |
Note — Optimizer reset: Phase 2 of this run restarted the optimizer from scratch and started a new LR warmup at step 0. The v1 TST runs carry optimizer state and LR schedule across phases for a unified schedule.
Dataset
Trained on open-index/open-wikipedia-markdown (Wikipedia Markdown). Tokenized with a custom 16,000-token BPE vocabulary.
- Total raw tokens seen: 491,520,000
- Model-context tokens per step: 65,536 (16 seqs × 2 grad accum × 2048 seq len)
Note on token count: TST phase 1 processes 6× more raw tokens per step because each sequence position is formed by averaging a bag of 6 token embeddings. This model was not trained for more steps than the base models — all AOMTS runs use the same 3,000-step budget. The higher raw token count reflects the bag expansion in phase 1 only.
Usage
Each repo includes modeling_aomts.py — a self-contained inference script with no external model code required.
pip install torch safetensors tokenizers
Command-line generation:
python modeling_aomts.py --repo_dir /path/to/repo --prompt "The theory of" --max_new_tokens 200
Python API:
from modeling_aomts import load_model, generate
model, tokenizer = load_model(".") # add device="cuda" for GPU
print(generate(model, tokenizer, "The theory of relativity states",
max_new_tokens=200, temperature=1.0, top_k=50))
Generation options: temperature (lower = less random; 0 = greedy), top_k, top_p (nucleus sampling), max_new_tokens, device, dtype.
Full Experiment Comparison
All AOMTS models at a glance (equal 3,000-step budget, sorted by validation loss):
| Model | MTP Depth | TST | LR Schedule | Optim Reset | Val Loss |
|---|---|---|---|---|---|
| AOMTS-TST-s6-100M-3k-1MTP-v1 | 1 | Yes (s=6¹, 900 steps) | WSD | — | 2.204673 |
| AOMTS-TST-s6-100M-3k-0MTP-v1 | 0 | Yes (s=6¹, 900 steps) | WSD | — | |
| AOMTS-TST-s6-100M-3k-2MTP-v1 | 2 | Yes (s=6¹, 900 steps) | WSD | — | 2.214605 |
| AOMTS-Base-100M-3k-1MTP-v1 | 1 | No | WSD | — | 2.276289 |
| AOMTS-Base-100M-3k-2MTP-v1 | 2 | No | WSD | — | 2.284260 |
| AOMTS-Base-100M-3k-0MTP-v1-run2 | 0 | No | WSD | — | 2.287432 |
| AOMTS-TST-s6-100M-3k-0MTP-RESET-v1 ← this model | 0 | Yes (s=6¹, 900 steps) | WSD | Yes² | 2.302689 |
| AOMTS-Base-100M-3k-1MTP-Cosine-v1 | 1 | No | Cosine | — | 2.354897 |
| AOMTS-Base-100M-3k-0MTP-v1 | 0 | No | WSD | — | 2.375539 |
| ¹ s = bag size: the number of raw tokens averaged into each compressed embedding position during TST phase 1. |
² Optim Reset = phase 2 restarted the optimizer state and LR schedule from scratch rather than carrying them over from phase 1. Models without this flag use a unified schedule across both phases.
Notes
Phase 2 restarted the optimizer from scratch and reset the LR schedule to a new warmup. This explored the effect of restarting optimizer state and LR schedule versus keeping it continuous throughout (which was done for one of the other experiments).
References
- Peng, B., Gigant, E., Quesnelle, J. (Nous Research, 2025). Token Superposition Training for Language Model Pretraining. arXiv:2605.06546
- DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437
- Downloads last month
- 46
