VOOZH about

URL: https://huggingface.co/vigneshwar234/TemporalMesh-Transformer

⇱ vigneshwar234/TemporalMesh-Transformer · Hugging Face


TemporalMesh Transformer (TMT v3)

Author: Vigneshwar LK
Paper: DOI 10.5281/zenodo.20287197
Code: github.com/vignesh2027/TemporalMesh-Transformer
Live Demo: HuggingFace Space
Benchmarks: TMT-Benchmarks Dataset


What is TMT?

TMT is a novel PyTorch transformer architecture that simultaneously resolves three fundamental inefficiencies in standard transformers:

Problem Standard Transformer TMT Solution
Quadratic attention cost $O(S^2)$ per layer Mesh Attention: $O(S \cdot k)$ dynamic $k$NN graph
Static attention topology Fixed fully-connected Dynamic graph rebuilt per-layer from cosine similarity
Uniform token compute All tokens use all $N$ layers Adaptive Depth Routing: exit gate per token, avg 5.8/12 layers
Flat positional encoding Position only Temporal Decay: learned multiplicative semantic attenuation
No cross-sequence memory Stateless EMA Memory Anchors: 16 persistent fast-weight vectors

Results

Model WikiText-2 PPL ↓ WikiText-103 PPL ↓ LongBench ↑ Compute
Vanilla Transformer 42.1 51.3 41.2 100%
Longformer 39.6 47.2 49.8 62%
Mamba 31.8 38.4 51.3 55%
RWKV 33.1 40.9 48.7 50%
Full TMT 29.4 36.1 53.4 48%

All models: ~120M parameters. TMT trained for 10K steps on WikiText-2 (AdamW, cosine LR, seeds 42/1337/2024).


Architecture at a Glance

Input → Token Embedding + RoPE
 → [× 12 layers]
 MeshBuilder (kNN graph, cosine sim, top-k=8)
 Mesh Attention O(S·k) + Temporal Decay Encoding
 EMA Memory Anchor Cross-Attention (16 anchors, β=0.99)
 Dual-Stream FFN (syntax stream ‖ semantic stream, sigmoid gate)
 Exit Gate σ(W_gate · x) > 0.85 → token frozen
 → LayerNorm → Tied Output Projection
 → Logits (B, S, V)

Output fields (TMTOutput dataclass):

  • logits — (B, S, V) next-token predictions
  • exit_masks — list of (B, S) booleans, one per layer
  • confidences — gate confidence per token per layer
  • graph_edges — sparse kNN edge list from final layer
  • memory_state — (M, D) final EMA anchor states
  • decay_scalars — temporal decay weights applied

Quick Start

git clone https://github.com/vignesh2027/TemporalMesh-Transformer
cd TemporalMesh-Transformer
pip install -e ".[dev]"
from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch

config = TMTConfig(
 vocab_size=50257,
 d_model=512,
 n_heads=8,
 n_layers=12,
 graph_k=8,
 exit_threshold=0.85,
 memory_anchors=16,
)
model = TMTModel(config) # ~120M params

tokens = torch.randint(0, 50257, (1, 256))
out = model(tokens)

print(out.logits.shape) # (1, 256, 50257)
print(out.exit_masks[-1]) # which tokens exited at layer 12
avg_exit = sum(m.float().mean() for m in out.exit_masks) / len(out.exit_masks)
print(f"Avg exit layer: {avg_exit:.2f}") # ~5.8

Training

python scripts/train.py \
 --dataset wikitext-2 \
 --model_size base \
 --steps 10000 \
 --lr 3e-4 \
 --batch_size 16 \
 --seq_len 256 \
 --exit_threshold 0.85 \
 --graph_k 8

Ablation Summary

Config PPL ↓ Compute VRAM
Vanilla Transformer 42.1 100% 18.4 GB
+ Mesh Attention only 37.8 62% 11.2 GB
+ Temporal Decay only 40.3 98% 18.4 GB
+ Adaptive Exit only 39.6 51% 18.4 GB
Mesh + Decay 34.2 61% 11.2 GB
Mesh + Exit 35.1 50% 11.2 GB
Full TMT 29.4 48% 11.2 GB

The full combination achieves superadditive gains: interaction effect = 4.1 PPL beyond sum of individual contributions.


Citation

@misc{vigneshwar2026tmt,
 title = {TemporalMesh Transformer: Dynamic Graph Attention with
 Temporal Semantic Decay and Per-Token Adaptive Depth Routing},
 author = {Vigneshwar LK},
 year = {2026},
 doi = {10.5281/zenodo.20287197},
 url = {https://zenodo.org/records/20287390}
}

License

MIT License · © 2026 Vigneshwar LK

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train vigneshwar234/TemporalMesh-Transformer

Space using vigneshwar234/TemporalMesh-Transformer 1

Evaluation results