Reinforcement Learning

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

FFanfan LiuYYouyang YinPPeng ShiSSiqi YangZZhixiong ZengHHaibo Qiu
Published
February 5, 2026
Authors
6
Word Count
4,843

LUSPO addresses length bias in RLVR for better performance.

Abstract

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

Key Takeaways

  • 1

    Existing RLVR algorithms suffer from length bias.

  • 2

    LUSPO neutralizes length bias by scaling loss by sequence length.

  • 3

    LUSPO enables models to generate longer, more detailed responses.

Limitations

  • LUSPO's effectiveness may vary across different tasks.

  • Computational cost may increase due to length-based scaling.

Keywords

Reinforcement Learning with Verifiable RewardsLLMsVision-Language Modelsresponse lengthsequence policy optimizationGroup Sequence Policy OptimizationLength-Unbiased Sequence Policy Optimizationmathematical reasoningmultimodal reasoning

More in Reinforcement Learning

View all
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR | Paperchime