Differentiable optimal transport rewrites a dense feed‑forward layer into a balanced mixture‑of‑experts without any hand‑crafted routing or expert sizing. The paper treats neuron assignment as a transport problem and solves it with Sinkhorn‑Knopp iterations, so the conversion is fully differentiable and end‑to‑end trainable.
Before DOT‑MoE, turning a pretrained dense model into a sparse expert system required heuristic clustering of neurons or random splits, and training MoEs from scratch was notoriously unstable. Those approaches offered no principled way to guarantee expert capacity or to jointly learn token routing.
DOT‑MoE “retaining 90% of the original dense model’s performance while reducing active parameters by 50%” across multiple architectures and benchmarks, which is the most compelling evidence that the transport‑based refactor preserves quality at half the compute cost [1].
In addition, “DOT‑MoE achieves the lowest perplexity (7.99) among all existing methods, outperforming the state‑of‑the‑art DISP‑LLM (9.84) by a substantial margin,” demonstrating that the OT formulation does more than match pruning baselines—it actually improves predictive fidelity [1].
The method still hinges on differentiable Sinkhorn iterations, which add overhead to the conversion pipeline and may limit scalability to extremely large models. Moreover, the evaluation focuses on feed‑forward networks; extending the transport‑based decomposition to attention heads or other architectural components remains unexplored.
If the reported gains hold broadly, the default workflow for model compression can shift from ad‑hoc expert design to a systematic DOT‑MoE conversion, letting engineers halve active parameters while staying within a 10% performance envelope. This eliminates a major source of manual tuning and opens a scalable path to sparse inference for large‑scale pretrained models.
References
For further actions, you may consider blocking this person and/or reporting abuse
