ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning

XXiaoxuan WangHHan ZhangHHaixin WangYYidan ShiRRuoyan LiKKaiqiao HanCChenyi TongHHaoran DengRRenliang SunAAlexander TaylorYYanqiao ZhuJJason CongYYizhou SunWWei Wang

Published: February 25, 2026
Authors: 14
Word Count: 2,627
Code: Includes code

View on arXiv Download PDF

ARLArena provides a systematic framework for stable agentic reinforcement learning through SAMPO optimization.

Abstract

Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks. Despite encouraging early results, ARL remains highly unstable, often leading to training collapse. This instability limits scalability to larger environments and longer interaction horizons, and constrains systematic exploration of algorithmic design choices. In this paper, we first propose ARLArena, a stable training recipe and systematic analysis framework that examines training stability in a controlled and reproducible setting. ARLArena first constructs a clean and standardized testbed. Then, we decompose policy gradient into four core design dimensions and assess the performance and stability of each dimension. Through this fine-grained analysis, we distill a unified perspective on ARL and propose SAMPO, a stable agentic policy optimization method designed to mitigate the dominant sources of instability in ARL. Empirically, SAMPO achieves consistently stable training and strong performance across diverse agentic tasks. Overall, this study provides a unifying policy gradient perspective for ARL and offers practical guidance for building stable and reproducible LLM-based agent training pipelines.

Key Takeaways

1
Agentic reinforcement learning training collapses due to distribution shift and cascading errors in multi-turn tasks.
2
ARLArena framework systematically analyzes four orthogonal policy gradient dimensions to identify stability factors.
3
SAMPO method combines behavior cloning, format penalties, and KL regularization for stable agent training.

Limitations

Previous approaches used ad-hoc patches without systematic analysis of which dimensions actually matter.
Tolerant clipping provides fast initial gains but causes training collapse in later stages.

Keywords

agentic reinforcement learningpolicy gradienttraining stabilitypolicy optimizationARLArenaSAMPO

More in Reinforcement Learning

View all

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao +3

Training stability remains a central challenge in reinforcement learning (RL) for large language models (LLMs). Policy staleness, asynchronous training, and mismatches between training and inference e...

Feb 11167

Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang, Zixuan Huang +8

We introduce Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), a new learning paradigm that addresses the inefficiencies of isolated on-policy optimization. HACRL enables collaborative...

Mar 3146

Your Group-Relative Advantage Is Biased

Fengkai Yang, Zherui Chen +11

Reinforcement Learning from Verifier Rewards (RLVR) has emerged as a widely used approach for post-training large language models on reasoning tasks, with group-based methods such as GRPO and its vari...

Jan 13128

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin +9

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scar...

Mar 3122

Harder Is Better: Boosting Mathematical Reasoning via Difficulty-Aware GRPO and Multi-Aspect Question Reformulation

Yanqi Dai, Yuxiang Ji +4

Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challen...

Jan 2890

More Reinforcement Learning papers