TemporalMesh Transformer (TMT v3)

Author: Vigneshwar LK
Paper: DOI 10.5281/zenodo.20287197
Code: github.com/vignesh2027/TemporalMesh-Transformer
Live Demo: HuggingFace Space
Benchmarks: TMT-Benchmarks Dataset

What is TMT?

TMT is a novel PyTorch transformer architecture that simultaneously resolves three fundamental inefficiencies in standard transformers:

Problem	Standard Transformer	TMT Solution
Quadratic attention cost	$O(S^2)$ per layer	Mesh Attention: $O(S \cdot k)$ dynamic $k$NN graph
Static attention topology	Fixed fully-connected	Dynamic graph rebuilt per-layer from cosine similarity
Uniform token compute	All tokens use all $N$ layers	Adaptive Depth Routing: exit gate per token, avg 5.8/12 layers
Flat positional encoding	Position only	Temporal Decay: learned multiplicative semantic attenuation
No cross-sequence memory	Stateless	EMA Memory Anchors: 16 persistent fast-weight vectors

Results

Model	WikiText-2 PPL ↓	WikiText-103 PPL ↓	LongBench ↑	Compute
Vanilla Transformer	42.1	51.3	41.2	100%
Longformer	39.6	47.2	49.8	62%
Mamba	31.8	38.4	51.3	55%
RWKV	33.1	40.9	48.7	50%
Full TMT	29.4	36.1	53.4	48%

All models: ~120M parameters. TMT trained for 10K steps on WikiText-2 (AdamW, cosine LR, seeds 42/1337/2024).

Architecture at a Glance

Input → Token Embedding + RoPE
 → [× 12 layers]
 MeshBuilder (kNN graph, cosine sim, top-k=8)
 Mesh Attention O(S·k) + Temporal Decay Encoding
 EMA Memory Anchor Cross-Attention (16 anchors, β=0.99)
 Dual-Stream FFN (syntax stream ‖ semantic stream, sigmoid gate)
 Exit Gate σ(W_gate · x) > 0.85 → token frozen
 → LayerNorm → Tied Output Projection
 → Logits (B, S, V)

Output fields (TMTOutput dataclass):

logits — (B, S, V) next-token predictions
exit_masks — list of (B, S) booleans, one per layer
confidences — gate confidence per token per layer
graph_edges — sparse kNN edge list from final layer
memory_state — (M, D) final EMA anchor states
decay_scalars — temporal decay weights applied

Quick Start

git clone https://github.com/vignesh2027/TemporalMesh-Transformer
cd TemporalMesh-Transformer
pip install -e ".[dev]"

from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
import torch

config = TMTConfig(
 vocab_size=50257,
 d_model=512,
 n_heads=8,
 n_layers=12,
 graph_k=8,
 exit_threshold=0.85,
 memory_anchors=16,
)
model = TMTModel(config) # ~120M params

tokens = torch.randint(0, 50257, (1, 256))
out = model(tokens)

print(out.logits.shape) # (1, 256, 50257)
print(out.exit_masks[-1]) # which tokens exited at layer 12
avg_exit = sum(m.float().mean() for m in out.exit_masks) / len(out.exit_masks)
print(f"Avg exit layer: {avg_exit:.2f}") # ~5.8

Training

python scripts/train.py \
 --dataset wikitext-2 \
 --model_size base \
 --steps 10000 \
 --lr 3e-4 \
 --batch_size 16 \
 --seq_len 256 \
 --exit_threshold 0.85 \
 --graph_k 8

Ablation Summary

Config	PPL ↓	Compute	VRAM
Vanilla Transformer	42.1	100%	18.4 GB
+ Mesh Attention only	37.8	62%	11.2 GB
+ Temporal Decay only	40.3	98%	18.4 GB
+ Adaptive Exit only	39.6	51%	18.4 GB
Mesh + Decay	34.2	61%	11.2 GB
Mesh + Exit	35.1	50%	11.2 GB
Full TMT	29.4	48%	11.2 GB

The full combination achieves superadditive gains: interaction effect = 4.1 PPL beyond sum of individual contributions.

Citation

@misc{vigneshwar2026tmt,
 title = {TemporalMesh Transformer: Dynamic Graph Attention with
 Temporal Semantic Decay and Per-Token Adaptive Depth Routing},
 author = {Vigneshwar LK},
 year = {2026},
 doi = {10.5281/zenodo.20287197},
 url = {https://zenodo.org/records/20287390}
}

License

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train vigneshwar234/TemporalMesh-Transformer

Space using vigneshwar234/TemporalMesh-Transformer 1

Evaluation results

Validation Perplexity on WikiText-2
self-reported
29.400
Validation Perplexity on WikiText-103
self-reported
36.100

URL: https://huggingface.co/vigneshwar234/TemporalMesh-Transformer

⇱ vigneshwar234/TemporalMesh-Transformer · Hugging Face