VOOZH about

URL: https://huggingface.co/hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1

⇱ hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1 · Hugging Face


AOMTS-TST-s6-100M-3k-1MTP-v1

Validation loss: 2.204673 (next-token cross-entropy, nats)

Part of the Aurora Optimized Multi-Token Superposition (AOMTS) experiment series.

This series evaluates whether Token Superposition Training (TST) and Multi-Token Prediction (MTP) improve language model quality, and whether combining them yields further gains.

Key findings: TST alone improved validation loss by ~0.073 nats over the base (no-TST, no-MTP) model. MTP=1 alone improved by ~0.011 nats. Combining TST with MTP=1 achieved the best result in the series at 2.204673 nats — a total improvement of ~0.083 nats over the base. TST+MTP=2 did not improve over TST+MTP=1, suggesting diminishing returns beyond one MTP head at this scale.

Validation loss is next-token cross-entropy in nats, evaluated on a held-out Wikipedia Markdown validation set using the same 16,000-token BPE vocabulary. Lower is better.

Research artifact. These checkpoints are screening-scale models (3,000 steps, ~100M parameters) released for research and ablation comparison. They are not intended as production models.

Why 3,000 steps? After dozens of prior experiments running 15,000+ steps, it was consistently observed that the winning model was already ahead of competing runs within the first 2,000 steps. Running to 3,000 steps provides a clear signal while keeping turnaround fast enough to run many conditions in parallel.

Why ~100M parameters? After many experiments at 200M–500M parameters, the model that won at larger scale consistently also won at ~100M. Screening at ~100M is therefore a reliable and efficient proxy: top candidates from this series will be scaled further.

What might change at scale? At ~100M parameters and 3,000 steps, the model has limited capacity to predict far into the future — which likely explains why MTP=1 was optimal and MTP=2 did not help. A small model trained on relatively little data cannot reliably leverage the signal from heads that predict multiple steps ahead; the additional auxiliary loss may add noise rather than useful gradient. At larger model sizes and longer training runs, the optimal MTP depth is expected to increase as the model gains the capacity to make accurate multi-step predictions. Similarly, the optimal TST bag size (s=6 here) may shift with scale — larger models may benefit from larger or smaller bags depending on how effectively they can decompress the superposition signal during recovery. Further research is needed to determine how these findings scale across model size, training budget, and TST bag size.

About This Model

The best model in the AOMTS series. Combines Token Superposition Training (TST, bag size s=6) with one Multi-Token Prediction (MTP) head, achieving the lowest validation loss at 2.204673 nats — a total improvement of ~0.083 nats over the no-TST, no-MTP baseline (AOMTS-Base-100M-3k-0MTP-v1-run2). TST and MTP=1 are complementary; adding a second MTP head (AOMTS-TST-s6-100M-3k-2MTP-v1) does not further improve results at this scale.

Architecture

Parameter Value
Vocabulary size 16,000
Hidden dimension (d_model) 512
Layers 12
Attention heads 8
KV heads 8
Head dimension 64
FFN hidden dimension 4,800
FFN variant SwiGLU
Max sequence length 2,048
RoPE θ 10,000
Normalization RMSNorm
Tied embeddings Yes
  • Total parameters: 126,401,024
    • Embeddings (tied tok_emb / lm_head): 8,192,000
    • Non-embedding, non-MTP (transformer blocks): 109,261,312 (identical across all AOMTS runs)
    • MTP heads (1 × 8,947,712): 8,947,712

MTP parameters are auxiliary training heads. They are not used during standard language modeling evaluation and do not affect validation loss — the val loss reported here is computed from the main head only. The non-embedding, non-MTP parameter count (109,261,312) is identical across all runs in this series.

Training

Setting Value
Total steps 3,000
Batch size 16 sequences
Gradient accumulation 2
Effective batch size 32 sequences / 65,536 model-context tokens per step
Total raw tokens seen 491,520,000 (phase 1 processes 12,288 raw tokens/step via bag_size=6 expansion — same 3,000 training steps as all other AOMTS runs)
Sequence length 2,048
LR schedule WSD — 150 warmup steps, stable LR, then linear decay over the last 300 steps (final 10 % of training) to 0.0
Warmup steps 150
Min LR 0.0
Weight decay 0.1

Optimizer: Aurora (matrix weights) + AdamW (embeddings & norms)

  • Aurora matrix weight lr: 0.02
  • AdamW embedding/norm lr: 0.0003
  • Weight decay: 0.1
  • Gradient clip: 1.0

Multi-Token Prediction (MTP)

MTP depth: 1 additional prediction head MTP loss weight: 0.1 Parameters per MTP head: 8,947,712 Total MTP parameters: 8,947,712

During standard (phase 2 / non-TST) training, each MTP head (d = 1) predicts the token at position i+1+d given the hidden state at position i. The MTP auxiliary loss is added to the main CE loss with weight 0.1.

MTP during TST phase 1 (bag-shifted predictions):

In the superposition phase the model operates on a compressed sequence where each position represents a bag of 6 raw tokens. The main head at compressed position i predicts the next bag — raw tokens [(i+1)·6, …, (i+2)·6−1] — using Multi-hot Cross-Entropy (MCE) loss (average CE over all 6 targets).

MTP head d predicts a bag shifted d tokens forward relative to the main target: raw tokens [(i+1)·6+d, …, (i+2)·6+d−1]. Concretely:

  • MTP head 1: raw tokens [(i+1)·6+1, …, (i+2)·6] (shifted 1 token past the main target)

This ensures all MTP heads receive meaningful gradient signal during phase 1. In phase 2 (recovery), MTP heads revert to standard next-token prediction.

Token Superposition Training (TST)

This model was trained with Token Superposition Training (TST), following arXiv:2605.06546 (Peng, Gigant, Quesnelle — Nous Research).

Phase 1 — Superposition (900 steps, 30% of budget)

  • Token embeddings are grouped into non-overlapping bags of 6 tokens
  • Each bag is averaged into a single embedding vector
  • The model operates on a compressed sequence of length 2048 (from 12288 raw tokens)
  • Training objective: Multi-hot Cross-Entropy (MCE) — predict the next bag of 6 tokens

Phase 2 — Recovery (2100 steps, 70% of budget)

  • Standard next-token cross-entropy prediction on the original sequence length (2048)
  • Model weights, optimizer state, and LR schedule continue from phase 1 (unified schedule)
TST Setting Value
Bag size (s) 6
Phase 1 steps 900
Phase 2 steps 2100
Optimizer state carried over Yes
LR schedule carried over Yes

Dataset

Trained on open-index/open-wikipedia-markdown (Wikipedia Markdown). Tokenized with a custom 16,000-token BPE vocabulary.

  • Total raw tokens seen: 491,520,000
  • Model-context tokens per step: 65,536 (16 seqs × 2 grad accum × 2048 seq len)

Note on token count: TST phase 1 processes 6× more raw tokens per step because each sequence position is formed by averaging a bag of 6 token embeddings. This model was not trained for more steps than the base models — all AOMTS runs use the same 3,000-step budget. The higher raw token count reflects the bag expansion in phase 1 only.

Usage

Each repo includes modeling_aomts.py — a self-contained inference script with no external model code required.

pip install torch safetensors tokenizers

Command-line generation:

python modeling_aomts.py --repo_dir /path/to/repo --prompt "The theory of" --max_new_tokens 200

Python API:

from modeling_aomts import load_model, generate

model, tokenizer = load_model(".") # add device="cuda" for GPU
print(generate(model, tokenizer, "The theory of relativity states",
 max_new_tokens=200, temperature=1.0, top_k=50))

Generation options: temperature (lower = less random; 0 = greedy), top_k, top_p (nucleus sampling), max_new_tokens, device, dtype.

Full Experiment Comparison

All AOMTS models at a glance (equal 3,000-step budget, sorted by validation loss):

Model MTP Depth TST LR Schedule Optim Reset Val Loss
AOMTS-TST-s6-100M-3k-1MTP-v1this model 1 Yes (s=6¹, 900 steps) WSD 2.204673
AOMTS-TST-s6-100M-3k-0MTP-v1 0 Yes (s=6¹, 900 steps) WSD
AOMTS-TST-s6-100M-3k-2MTP-v1 2 Yes (s=6¹, 900 steps) WSD 2.214605
AOMTS-Base-100M-3k-1MTP-v1 1 No WSD 2.276289
AOMTS-Base-100M-3k-2MTP-v1 2 No WSD 2.284260
AOMTS-Base-100M-3k-0MTP-v1-run2 0 No WSD 2.287432
AOMTS-TST-s6-100M-3k-0MTP-RESET-v1 0 Yes (s=6¹, 900 steps) WSD Yes² 2.302689
AOMTS-Base-100M-3k-1MTP-Cosine-v1 1 No Cosine 2.354897
AOMTS-Base-100M-3k-0MTP-v1 0 No WSD 2.375539
¹ s = bag size: the number of raw tokens averaged into each compressed embedding position during TST phase 1.

² Optim Reset = phase 2 restarted the optimizer state and LR schedule from scratch rather than carrying them over from phase 1. Models without this flag use a unified schedule across both phases.

References

  • Peng, B., Gigant, E., Quesnelle, J. (Nous Research, 2025). Token Superposition Training for Language Model Pretraining. arXiv:2605.06546
  • DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437
Downloads last month
44
Safetensors
Model size
0.1B params
Tensor type
BF16
·

Dataset used to train hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1

Collection including hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1

Papers for hudsongouge/AOMTS-TST-s6-100M-3k-1MTP-v1