Reinforcement Learning

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

KKirill PavlenkoAAlexander GolubevSSimon KarasikBBoris Yangel
Published
February 10, 2026
Authors
4
Word Count
9,328
Code
Includes code

Blockwise advantage estimation solves multi-objective RL interference by decoupling rewards across output sections.

Abstract

Group Relative Policy Optimization (GRPO) assigns a single scalar advantage to all tokens in a completion. For structured generations with explicit segments and objectives, this couples unrelated reward signals across segments, leading to objective interference and misattributed credit. We propose Blockwise Advantage Estimation, a family of GRPO-compatible methods that assigns each objective its own advantage and applies it only to the tokens in the corresponding text block, reducing reliance on hand-designed scalar rewards and scaling naturally to additional objectives. A key challenge is estimating advantages for later blocks whose rewards are conditioned on sampled prefixes; standard unbiased approaches require expensive nested rollouts from intermediate states. Concretely, we introduce an Outcome-Conditioned Baseline that approximates intermediate state values using only within-group statistics by stratifying samples according to a prefix-derived intermediate outcome. On math tasks with uncertainty estimation, our method mitigates reward interference, is competitive with a state-of-the-art reward-designed approach, and preserves test-time gains from confidence-weighted ensembling. More broadly, it provides a modular recipe for optimizing sequential objectives in structured generations without additional rollouts.

Key Takeaways

  • 1

    Blockwise Advantage Estimation decouples multi-objective rewards by computing separate advantages for different text blocks instead of one scalar.

  • 2

    GRPO's monolithic approach causes objective interference when different output sections serve different purposes in language model tasks.

  • 3

    BAE uses efficient baseline estimation for sequential blocks without expensive Monte Carlo rollouts, maintaining GRPO's computational efficiency.

Limitations

  • The method requires clear structural decomposition of tasks into distinct blocks, limiting applicability to unstructured outputs.

  • Baseline estimation for later blocks remains approximate due to computational constraints of language model generation.

Keywords

Group Relative Policy Optimizationadvantage estimationreward interferencestructured generationstext blocksoutcome-conditioned baselinenested rolloutsconfidence-weighted ensembling

More in Reinforcement Learning

View all
Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards | Paperchime