Your Group-Relative Advantage Is Biased
Fengkai Yang, Zherui Chen +11
Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its vari...