Large Language Models

Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

SShyam Sundhar RameshXXiaotong JiMMatthieu ZimmerSSangwoong YoonZZhiyong WangHHaitham Bou AmmarAAurelien LucchiIIlija Bogunovic
Published
February 5, 2026
Authors
8
Word Count
11,420
Code
Includes code

MT-GRPO ensures balanced LLM performance across tasks.

Abstract

RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.

Key Takeaways

  • 1

    MT-GRPO balances LLM performance across multiple tasks.

  • 2

    Uses improvement-aware task reweighting and ratio-preserving sampler.

  • 3

    Enhances reliability of LLMs in diverse real-world applications.

Limitations

  • Requires careful tuning of parameters for optimal performance.

  • May not fully address extreme task imbalances.

Keywords

GRPOmulti-task adaptationworst-task performancepolicy gradientsratio-preserving samplertask weightsoptimization signaltraining efficiency

More in Large Language Models

View all
Multi-Task GRPO: Reliable LLM Reasoning Across Tasks | Paperchime