Reinforcement Learning

Language-based Trial and Error Falls Behind in the Era of Experience

HHaoyu WangGGuozheng MaSShugang CuiYYilun KongHHaotian LuoLLi ShenMMengya GaoYYichao WuXXiaogang WangDDacheng Tao
Published
January 29, 2026
Authors
10
Word Count
11,252
Code
Includes code

SCOUT framework enhances LLM performance on new tasks.

Abstract

While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

Key Takeaways

  • 1

    SCOUT framework improves LLM efficiency on unseen tasks.

  • 2

    Uses lightweight scouts for exploration, reducing computational cost.

  • 3

    Outperforms baseline models and enables multi-task learning.

Limitations

  • Tested on models up to 3B parameters.

  • Requires high-end GPUs for fine-tuning and RL stages.

Keywords

Large Language Modelsagentic taskspretraining distributiontesting distributionexploration costtrial-and-errorparameter-heavy LLMshigh dimensional semantic spacelightweight scoutsMLPsenvironmental dynamicsSupervised Fine-Tuningreinforcement learningQwen2.5-3B-InstructGemini-2.5-Pro

More in Reinforcement Learning

View all
Language-based Trial and Error Falls Behind in the Era of Experience | Paperchime