Reinforcement Learning

ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

XXiaoxuan WangHHan ZhangHHaixin WangYYidan ShiRRuoyan LiKKaiqiao HanCChenyi TongHHaoran DengRRenliang SunAAlexander TaylorYYanqiao ZhuJJason CongYYizhou SunWWei Wang
Published
February 25, 2026
Authors
14
Word Count
2,627
Code
Includes code

ARLArena provides a systematic framework for stable agentic reinforcement learning through SAMPO optimization.

Abstract

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

Key Takeaways

  • 1

    Agentic reinforcement learning training collapses due to distribution shift and cascading errors in multi-turn tasks.

  • 2

    ARLArena framework systematically analyzes four orthogonal policy gradient dimensions to identify stability factors.

  • 3

    SAMPO method combines behavior cloning, format penalties, and KL regularization for stable agent training.

Limitations

  • Previous approaches used ad-hoc patches without systematic analysis of which dimensions actually matter.

  • Tolerant clipping provides fast initial gains but causes training collapse in later stages.

Keywords

agentic reinforcement learningpolicy gradienttraining stabilitypolicy optimizationARLArenaSAMPO

More in Reinforcement Learning

View all
ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning | Paperchime