![]() |
VOOZH | about |
Efficiency Gain:
More aggressive truncation introduces greater training instability by increasing group-wise advantage variance, which in turn leads to more biased advantage estimates, thus we propose to swap out the group-wise normalization with batch-wise normalization to mitigate this issue.
Clipping the updates of low-probability, high-entropy tokens—essential for exploring diverse reasoning paths—can cause an entropy collapse that limits exploration. Increasing the clipping threshold preserves these tokens during gradient updates, thereby alleviating entropy collapse.
Applying length penalty makes early training too difficult and later stages dominated by easy samples. We adopt Dynamic Sampling, which filters out overly easy/hard prompts and extreme response lengths, yielding balanced training signals and better length control.
We unify batch-wise reward normalization, a higher policy update clipping threshold, dynamic sampling to remove instances lacking balanced training signals, and a simple length truncation penalty into a comprehensive training recipe, which we term DLER (Doing Length pEnalty Right).
We show that with DLER, the effect of adopting different length-penalty rewards fundamentally changes. Specifically, the accuracy–length relationship is no longer altered in a way that yields strictly shorter responses with higher accuracy; instead, a trade-off always exists.
@misc{liu2025dlerdoinglengthpenalty,
title={DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning},
author={Shih-Yang Liu and Xin Dong and Ximing Lu and Shizhe Diao and Mingjie Liu and Min-Hung Chen and Hongxu Yin and Yu-Chiang Frank Wang and Kwang-Ting Cheng and Yejin Choi and Jan Kautz and Pavlo Molchanov},
year={2025},
eprint={2510.15110},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.15110},
}