Generative AI

Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO

YYunze TongMMushui LiuCCanyu ZhaoWWanggui HeSShiyi ZhangHHongwei ZhangPPeng ZhangJJinlong LiuJJu HuangJJiamang WangHHao JiangPPipei Huang
Published
February 6, 2026
Authors
12
Word Count
8,099
Code
Includes code

Enhanced GRPO for better text-to-image generation.

Abstract

Deploying GRPO on Flow Matching models has proven effective for text-to-image generation. However, existing paradigms typically propagate an outcome-based reward to all preceding denoising steps without distinguishing the local effect of each step. Moreover, current group-wise ranking mainly compares trajectories at matched timesteps and ignores within-trajectory dependencies, where certain early denoising actions can affect later states via delayed, implicit interactions. We propose TurningPoint-GRPO (TP-GRPO), a GRPO framework that alleviates step-wise reward sparsity and explicitly models long-term effects within the denoising trajectory. TP-GRPO makes two key innovations: (i) it replaces outcome-based rewards with step-level incremental rewards, providing a dense, step-aware learning signal that better isolates each denoising action's "pure" effect, and (ii) it identifies turning points-steps that flip the local reward trend and make subsequent reward evolution consistent with the overall trajectory trend-and assigns these actions an aggregated long-term reward to capture their delayed impact. Turning points are detected solely via sign changes in incremental rewards, making TP-GRPO efficient and hyperparameter-free. Extensive experiments also demonstrate that TP-GRPO exploits reward signals more effectively and consistently improves generation. Demo code is available at https://github.com/YunzeTong/TurningPoint-GRPO.

Key Takeaways

  • 1

    Improved reward alignment for each denoising step.

  • 2

    Identification of turning points for long-term impact.

  • 3

    Enhanced fine-tuning efficiency in text-to-image generation.

Limitations

  • Requires additional computation for ODE sampling.

  • Complexity in identifying and handling turning points.

Keywords

GRPOflow matching modelstext-to-image generationdenoising stepsreward sparsityincremental rewardsturning pointsdenoising trajectorydelayed impactreward evolution

More in Generative AI

View all
Alleviating Sparse Rewards by Modeling Step-Wise and Long-Term Sampling Effects in Flow-Based GRPO | Paperchime